Code review comment for lp:~ian-clatworthy/bzr/faster-diff2

Martin Pool (mbp) wrote :

2009/5/28 Ian Clatworthy <email address hidden>:
> Martin Pool wrote:
>> Review: Disapprove
>> I'm really not sure about the cache part of this:
>>
>> Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster.  Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level.  (I wonder what John thinks.)  Could you store it at the diff level?
>>
> So, there are 2 issues here. I'll respond to the simple one first.
>
> FWIW, I did initially try using a tuple as the cache key but different
> parts of the code use different tuples to look up the same entry, e.g.
> ('foo-file-id', None) and ('foo-file-id', 'foo') are actually the same
> thing. I explicitly append "path:" to path keys so that it can only
> break in the very obscure circumstance of:
>
> 1. a file-id starting with that prefix
> 2. the rest of the file-id matching a path in the tree
> 3. a path being passed to a method without a file-id
>
> So it's possible but about as likely as me winning the lottery. :-) One
> simple fix is to only cache if the file-id is set and just use it. Or we
> can prefix path keys with a character not permitted in file-ids. Of
> course, if we're not caching at the tree level then it doesn't matter.

It's just that if there are other ways to spell it, it seems a shame
to have the risk at all, and for that matter to have new strings in
use that are different from what's used elsewhere.

>
> Cache consistency on dirstate mutation is a far more serious problem.
> I'm only caching in DirStateRevisionTree, not WorkingTree, so I (perhaps
> naively) felt it couldn't change. Or that the change was easily detected?

I haven't strictly checked if this will break your cache, but the
dirstate object can hold different basis trees after a commit or
update. (set_parent_trees or something.) I can't prove it will fail
but it certainly seems risky, or likely to break later on.

>
> I'm not overly keen to solve cache consistency at the diff level. The
> code there is quite clean simply doing stuff like tree.get_mtime() and
> tree.get_file_text(). I don't think we want diff code knowing about or
> caring about dirstate entries. We could introduce a compound method
> though that hands back both the mtime and file-text at once. diff.py
> would need a fair amount of refactoring to use it though because the two
> bits of data are looked up quite separately currently. Another thing to
> keep in mind is that something like 'get_file_text_and_mtime()' only
> helps diff while a lower level cache potentially helps many operations.

That sounds a lot more attractive to me; it's just better all round to
do fewer calls and get all you want, rather than more calls. Aside
from avoiding possible bugs in the cache, it will be faster when the
cache misses, and faster even when the cache hits because of less
function calls.

>
> Another option is to move the cache down into dirstate, where it can be
> easily cleared when a change happens, particularly if my
> "faster-dirstate-saving" branch lands with the refactoring I've done in
> there. We seem to have a pretty common pattern of:
>
>  1. call iter changes to find what's different
>  2. call get_entry to lookup the entry/s for each change
>
> iter-changes is *really* fast while get-entry on large trees isn't
> (because it bisects every time). So maybe iter_changes ought to keep a
> cache of changed entries it finds and get_entry could check that first?
> If I'm right, the combination ought to scream along, speeding up diff
> and perhaps several other operations too. The downside is that I hate
> changing dirstate :-) and that the change might require modifications to
> both python and pyrex code. But it sounds a better direction in theory.
>
> How would you like me to proceed?

I believe the point of iter_changes was that it'd give you back all
you need, and we'd prefer not to do more round trips afterwards.
Ideally it would be a bit abstracted away from the details of dirstate
representation, but still tuples that are fast to construct. If there
are some items that would be slow to compute and not always needed you
could possibly pass back a (curried?) callable.

What does diff need to get from the tree that's not returned by
iter_changes? The text and the mtime? Really neither of them ought
to need to look at the dirstate, if they have the file id and path, as
far as I can see. The mtime just comes from either the working copy
or the revision, and the text too can be determined by either the path
or the file-id.

--
Martin <http://launchpad.net/~mbp/>

« Back to merge proposal