Merge lp:~ian-clatworthy/bzr/faster-diff2 into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy on 2009-05-27
Status: Merged
Approved by: Ian Clatworthy on 2009-05-28
Approved revision: 4381
Merged at revision: not available
Proposed branch: lp:~ian-clatworthy/bzr/faster-diff2
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 73 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-diff2
Reviewer Review Type Date Requested Status
Martin Pool 2009-05-27 Disapprove on 2009-05-27
Review via email: mp+6813@code.launchpad.net
To post a comment you must log in.
Ian Clatworthy (ian-clatworthy) wrote :

Some more tweaks for speeding up diff. One tweak stops multiple lookups of the dirstate entry in DirStateRevisionTree while grabbing info (e.g. mtime, contents) about a file-id. The other skips an id2path conversion when the path is already known.

On OOo, these take the time down from 2.6 to 1.7 seconds for 'bzr diff' and from 1.5 to 1.0 seconds for 'bzr diff file'.

Martin Pool (mbp) wrote :

I'm really not sure about the cache part of this:

Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster. Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level. (I wonder what John thinks.) Could you store it at the diff level?

review: Disapprove
Ian Clatworthy (ian-clatworthy) wrote :

Martin Pool wrote:
> Review: Disapprove
> I'm really not sure about the cache part of this:
>
> Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster. Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level. (I wonder what John thinks.) Could you store it at the diff level?
>
So, there are 2 issues here. I'll respond to the simple one first.

FWIW, I did initially try using a tuple as the cache key but different
parts of the code use different tuples to look up the same entry, e.g.
('foo-file-id', None) and ('foo-file-id', 'foo') are actually the same
thing. I explicitly append "path:" to path keys so that it can only
break in the very obscure circumstance of:

1. a file-id starting with that prefix
2. the rest of the file-id matching a path in the tree
3. a path being passed to a method without a file-id

So it's possible but about as likely as me winning the lottery. :-) One
simple fix is to only cache if the file-id is set and just use it. Or we
can prefix path keys with a character not permitted in file-ids. Of
course, if we're not caching at the tree level then it doesn't matter.

Cache consistency on dirstate mutation is a far more serious problem.
I'm only caching in DirStateRevisionTree, not WorkingTree, so I (perhaps
naively) felt it couldn't change. Or that the change was easily detected?

I'm not overly keen to solve cache consistency at the diff level. The
code there is quite clean simply doing stuff like tree.get_mtime() and
tree.get_file_text(). I don't think we want diff code knowing about or
caring about dirstate entries. We could introduce a compound method
though that hands back both the mtime and file-text at once. diff.py
would need a fair amount of refactoring to use it though because the two
bits of data are looked up quite separately currently. Another thing to
keep in mind is that something like 'get_file_text_and_mtime()' only
helps diff while a lower level cache potentially helps many operations.

Another option is to move the cache down into dirstate, where it can be
easily cleared when a change happens, particularly if my
"faster-dirstate-saving" branch lands with the refactoring I've done in
there. We seem to have a pretty common pattern of:

  1. call iter changes to find what's different
  2. call get_entry to lookup the entry/s for each change

iter-changes is *really* fast while get-entry on large trees isn't
(because it bisects every time). So maybe iter_changes ought to keep a
cache of changed entries it finds and get_entry could check that first?
If I'm right, the combination ought to scream along, speeding up diff
and perhaps several other operations too. The downside is that I hate
changing dirstate :-) and that the change might require modifications to
both python and pyrex code. But it sounds a better direction in theory.

How would you like me to proceed?

Martin Pool (mbp) wrote :
Download full text (4.7 KiB)

2009/5/28 Ian Clatworthy <email address hidden>:
> Martin Pool wrote:
>> Review: Disapprove
>> I'm really not sure about the cache part of this:
>>
>> Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster.  Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level.  (I wonder what John thinks.)  Could you store it at the diff level?
>>
> So, there are 2 issues here. I'll respond to the simple one first.
>
> FWIW, I did initially try using a tuple as the cache key but different
> parts of the code use different tuples to look up the same entry, e.g.
> ('foo-file-id', None) and ('foo-file-id', 'foo') are actually the same
> thing. I explicitly append "path:" to path keys so that it can only
> break in the very obscure circumstance of:
>
> 1. a file-id starting with that prefix
> 2. the rest of the file-id matching a path in the tree
> 3. a path being passed to a method without a file-id
>
> So it's possible but about as likely as me winning the lottery. :-) One
> simple fix is to only cache if the file-id is set and just use it. Or we
> can prefix path keys with a character not permitted in file-ids. Of
> course, if we're not caching at the tree level then it doesn't matter.

It's just that if there are other ways to spell it, it seems a shame
to have the risk at all, and for that matter to have new strings in
use that are different from what's used elsewhere.

>
> Cache consistency on dirstate mutation is a far more serious problem.
> I'm only caching in DirStateRevisionTree, not WorkingTree, so I (perhaps
> naively) felt it couldn't change. Or that the change was easily detected?

I haven't strictly checked if this will break your cache, but the
dirstate object can hold different basis trees after a commit or
update. (set_parent_trees or something.) I can't prove it will fail
but it certainly seems risky, or likely to break later on.

>
> I'm not overly keen to solve cache consistency at the diff level. The
> code there is quite clean simply doing stuff like tree.get_mtime() and
> tree.get_file_text(). I don't think we want diff code knowing about or
> caring about dirstate entries. We could introduce a compound method
> though that hands back both the mtime and file-text at once. diff.py
> would need a fair amount of refactoring to use it though because the two
> bits of data are looked up quite separately currently. Another thing to
> keep in mind is that something like 'get_file_text_and_mtime()' only
> helps diff while a lower level cache potentially helps many operations.

That sounds a lot more attractive to me; it's just better all round to
do fewer calls and get all you want, rather than more calls. Aside
from avoiding possible bugs in the cache, it will be faster when the
cache misses, and faster even when the cache hits because of less
function calls.

>
> Another option is to move the cache down into dirstate, where it can be
> easily cleared when a change happens, particularly if my
> "f...

Read more...

John A Meinel (jameinel) wrote :
Download full text (5.1 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> iter-changes is *really* fast while get-entry on large trees isn't
>> (because it bisects every time). So maybe iter_changes ought to keep a
>> cache of changed entries it finds and get_entry could check that first?
>> If I'm right, the combination ought to scream along, speeding up diff
>> and perhaps several other operations too. The downside is that I hate
>> changing dirstate :-) and that the change might require modifications to
>> both python and pyrex code. But it sounds a better direction in theory.
>>
>> How would you like me to proceed?
>
> I believe the point of iter_changes was that it'd give you back all
> you need, and we'd prefer not to do more round trips afterwards.
> Ideally it would be a bit abstracted away from the details of dirstate
> representation, but still tuples that are fast to construct. If there
> are some items that would be slow to compute and not always needed you
> could possibly pass back a (curried?) callable.
>
> What does diff need to get from the tree that's not returned by
> iter_changes? The text and the mtime? Really neither of them ought
> to need to look at the dirstate, if they have the file id and path, as
> far as I can see. The mtime just comes from either the working copy
> or the revision, and the text too can be determined by either the path
> or the file-id.
>

For a while now I've been of the mind that 'iter_changes' should be
returning real objects, rather than tuples. IMO the *important* bit is
that you need to avoid doing O(tree) objects for a primarily O(changes)
operation.

iter_changes() can be "O(tree)" but only when...
1) include_unchanged=True
2) initial commit

Under both cases, the action being *performed* is O(tree), so
iter_changes isn't creating extra overhead.

If I was to personally give a direction for this, it would be to create
a new iter_changes-like operation, which returns real objects with
attributes back to the calling code. (An initial skin could just wrap
iter_changes() into real objects.)
And along with those objects, you could then add "get me more
information", which could include stuff like caching in the dirblock
entry, to quickly return extra info.

If you feel the overhead is in 'bisect' time, then we should look at
where that bisect is being done. I would much rather have a 'location
hint' cache at the dirstate level, which would seed the bisect search.
So when iter_changes finds something interesting about ('foo', 'bar',
'baz') it can then set the cache hint, so that a lookup of ('foo',
'bar') knows right where to find the object. I would also design it so
that it grabs that entry, checks if it is correct. If not, it
invalidates the entry (possibly the whole cache?) and then does the
normal bisect.

This would also have implications for multiple '_get_entry' calls, since
if you just grabbed it for X.rev, it should be extremely fast to grab it
for X.mtime.

Note, I would use something like an LRU cache or even FIFO cache here. I
guess you did somethin along the same lines with the
'_last_entry[_lookup]' functions, though I think I might put it under
the _get_entry() call, rather than above it.

...

Read more...

Martin Pool (mbp) wrote :

2009/5/28 John A Meinel <email address hidden>:
> If I was to personally give a direction for this, it would be to create
> a new iter_changes-like operation, which returns real objects with
> attributes back to the calling code. (An initial skin could just wrap
> iter_changes() into real objects.)
> And along with those objects, you could then add "get me more
> information", which could include stuff like caching in the dirblock
> entry, to quickly return extra info.

I agree.

>
> If you feel the overhead is in 'bisect' time, then we should look at
> where that bisect is being done. I would much rather have a 'location
> hint' cache at the dirstate level, which would seed the bisect search.
> So when iter_changes finds something interesting about ('foo', 'bar',
> 'baz') it can then set the cache hint, so that a lookup of ('foo',
> 'bar') knows right where to find the object. I would also design it so
> that it grabs that entry, checks if it is correct. If not, it
> invalidates the entry (possibly the whole cache?) and then does the
> normal bisect.

I agree with that too, and the 4-point plan.

--
Martin <http://launchpad.net/~mbp/>

Martin Pool (mbp) wrote :

Also, if it works in isolation, can I suggest that you merge the first
part of this patch without the cache?

Ian Clatworthy (ian-clatworthy) wrote :

Martin Pool wrote:
>> If you feel the overhead is in 'bisect' time, then we should look at
>> where that bisect is being done. I would much rather have a 'location
>> hint' cache at the dirstate level, which would seed the bisect search.
>> So when iter_changes finds something interesting about ('foo', 'bar',
>> 'baz') it can then set the cache hint, so that a lookup of ('foo',
>> 'bar') knows right where to find the object. I would also design it so
>> that it grabs that entry, checks if it is correct. If not, it
>> invalidates the entry (possibly the whole cache?) and then does the
>> normal bisect.
>>
>
> I agree with that too, and the 4-point plan.
>
Me too. I think John's approach rocks.

Here's what I'd like to do:

1. back out the cache at the tree level
2. land the rest of this patch - the non-contentious bit
3. implement John's suggestion in a separate branch.

Sound ok?

Ian C.

Martin Pool (mbp) wrote :

2009/5/28 Ian Clatworthy <email address hidden>:
> 1. back out the cache at the tree level
> 2. land the rest of this patch - the non-contentious bit
> 3. implement John's suggestion in a separate branch.

Great.

--
Martin <http://launchpad.net/~mbp/>

lp:~ian-clatworthy/bzr/faster-diff2 updated on 2009-05-28
4381. By Ian Clatworthy on 2009-05-28

back out tree-level cache of dirstate entries as agreed during review

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-05-28 17:14:38 +0000
3+++ NEWS 2009-05-28 18:35:44 +0000
4@@ -23,6 +23,8 @@
5
6 * ``bzr branch --notree`` is now faster. (Ian Clatworthy)
7
8+* ``bzr diff`` is now faster on large trees. (Ian Clatworthy)
9+
10 Bug Fixes
11 *********
12
13
14=== modified file 'bzrlib/diff.py'
15--- bzrlib/diff.py 2009-03-28 02:10:54 +0000
16+++ bzrlib/diff.py 2009-05-28 18:35:44 +0000
17@@ -620,9 +620,11 @@
18 return self.CANNOT_DIFF
19 from_label = '%s%s\t%s' % (self.old_label, old_path, old_date)
20 to_label = '%s%s\t%s' % (self.new_label, new_path, new_date)
21- return self.diff_text(from_file_id, to_file_id, from_label, to_label)
22+ return self.diff_text(from_file_id, to_file_id, from_label, to_label,
23+ old_path, new_path)
24
25- def diff_text(self, from_file_id, to_file_id, from_label, to_label):
26+ def diff_text(self, from_file_id, to_file_id, from_label, to_label,
27+ from_path=None, to_path=None):
28 """Diff the content of given files in two trees
29
30 :param from_file_id: The id of the file in the from tree. If None,
31@@ -630,15 +632,17 @@
32 :param to_file_id: The id of the file in the to tree. This may refer
33 to a different file from from_file_id. If None,
34 the file is not present in the to tree.
35+ :param from_path: The path in the from tree or None if unknown.
36+ :param to_path: The path in the to tree or None if unknown.
37 """
38- def _get_text(tree, file_id):
39+ def _get_text(tree, file_id, path):
40 if file_id is not None:
41- return tree.get_file(file_id).readlines()
42+ return tree.get_file(file_id, path).readlines()
43 else:
44 return []
45 try:
46- from_text = _get_text(self.old_tree, from_file_id)
47- to_text = _get_text(self.new_tree, to_file_id)
48+ from_text = _get_text(self.old_tree, from_file_id, from_path)
49+ to_text = _get_text(self.new_tree, to_file_id, to_path)
50 self.text_differ(from_label, from_text, to_label, to_text,
51 self.to_file)
52 except errors.BinaryFile:
53@@ -882,7 +886,7 @@
54 self.to_file.write("=== modified %s '%s'%s\n" % (kind[0],
55 newpath_encoded, prop_str))
56 if changed_content:
57- self.diff(file_id, oldpath, newpath)
58+ self._diff(file_id, oldpath, newpath, kind[0], kind[1])
59 has_changes = 1
60 if renamed:
61 has_changes = 1
62@@ -903,7 +907,10 @@
63 new_kind = self.new_tree.kind(file_id)
64 except (errors.NoSuchId, errors.NoSuchFile):
65 new_kind = None
66-
67+ self._diff(file_id, old_path, new_path, old_kind, new_kind)
68+
69+
70+ def _diff(self, file_id, old_path, new_path, old_kind, new_kind):
71 result = DiffPath._diff_many(self.differs, file_id, old_path,
72 new_path, old_kind, new_kind)
73 if result is DiffPath.CANNOT_DIFF: