Code review comment for lp:~ian-clatworthy/bzr/faster-diff2

Ian Clatworthy (ian-clatworthy) wrote :

Martin Pool wrote:
> Review: Disapprove
> I'm really not sure about the cache part of this:
>
> Folding both file_id and path_id into a single string is potentially (in obscure cases) buggy, and at least seems unnecessary; using a tuple may even be faster. Also I'm seriously concerned that the cache is not invalidated in the dirstate is mutated, and would question whether it's a good idea to put a cache of unlimited duration in at this level. (I wonder what John thinks.) Could you store it at the diff level?
>
So, there are 2 issues here. I'll respond to the simple one first.

FWIW, I did initially try using a tuple as the cache key but different
parts of the code use different tuples to look up the same entry, e.g.
('foo-file-id', None) and ('foo-file-id', 'foo') are actually the same
thing. I explicitly append "path:" to path keys so that it can only
break in the very obscure circumstance of:

1. a file-id starting with that prefix
2. the rest of the file-id matching a path in the tree
3. a path being passed to a method without a file-id

So it's possible but about as likely as me winning the lottery. :-) One
simple fix is to only cache if the file-id is set and just use it. Or we
can prefix path keys with a character not permitted in file-ids. Of
course, if we're not caching at the tree level then it doesn't matter.

Cache consistency on dirstate mutation is a far more serious problem.
I'm only caching in DirStateRevisionTree, not WorkingTree, so I (perhaps
naively) felt it couldn't change. Or that the change was easily detected?

I'm not overly keen to solve cache consistency at the diff level. The
code there is quite clean simply doing stuff like tree.get_mtime() and
tree.get_file_text(). I don't think we want diff code knowing about or
caring about dirstate entries. We could introduce a compound method
though that hands back both the mtime and file-text at once. diff.py
would need a fair amount of refactoring to use it though because the two
bits of data are looked up quite separately currently. Another thing to
keep in mind is that something like 'get_file_text_and_mtime()' only
helps diff while a lower level cache potentially helps many operations.

Another option is to move the cache down into dirstate, where it can be
easily cleared when a change happens, particularly if my
"faster-dirstate-saving" branch lands with the refactoring I've done in
there. We seem to have a pretty common pattern of:

  1. call iter changes to find what's different
  2. call get_entry to lookup the entry/s for each change

iter-changes is *really* fast while get-entry on large trees isn't
(because it bisects every time). So maybe iter_changes ought to keep a
cache of changed entries it finds and get_entry could check that first?
If I'm right, the combination ought to scream along, speeding up diff
and perhaps several other operations too. The downside is that I hate
changing dirstate :-) and that the change might require modifications to
both python and pyrex code. But it sounds a better direction in theory.

How would you like me to proceed?

« Back to merge proposal