Code review comment for lp:~ian-clatworthy/bzr/faster-diff2

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> iter-changes is *really* fast while get-entry on large trees isn't
>> (because it bisects every time). So maybe iter_changes ought to keep a
>> cache of changed entries it finds and get_entry could check that first?
>> If I'm right, the combination ought to scream along, speeding up diff
>> and perhaps several other operations too. The downside is that I hate
>> changing dirstate :-) and that the change might require modifications to
>> both python and pyrex code. But it sounds a better direction in theory.
>>
>> How would you like me to proceed?
>
> I believe the point of iter_changes was that it'd give you back all
> you need, and we'd prefer not to do more round trips afterwards.
> Ideally it would be a bit abstracted away from the details of dirstate
> representation, but still tuples that are fast to construct. If there
> are some items that would be slow to compute and not always needed you
> could possibly pass back a (curried?) callable.
>
> What does diff need to get from the tree that's not returned by
> iter_changes? The text and the mtime? Really neither of them ought
> to need to look at the dirstate, if they have the file id and path, as
> far as I can see. The mtime just comes from either the working copy
> or the revision, and the text too can be determined by either the path
> or the file-id.
>

For a while now I've been of the mind that 'iter_changes' should be
returning real objects, rather than tuples. IMO the *important* bit is
that you need to avoid doing O(tree) objects for a primarily O(changes)
operation.

iter_changes() can be "O(tree)" but only when...
1) include_unchanged=True
2) initial commit

Under both cases, the action being *performed* is O(tree), so
iter_changes isn't creating extra overhead.

If I was to personally give a direction for this, it would be to create
a new iter_changes-like operation, which returns real objects with
attributes back to the calling code. (An initial skin could just wrap
iter_changes() into real objects.)
And along with those objects, you could then add "get me more
information", which could include stuff like caching in the dirblock
entry, to quickly return extra info.

If you feel the overhead is in 'bisect' time, then we should look at
where that bisect is being done. I would much rather have a 'location
hint' cache at the dirstate level, which would seed the bisect search.
So when iter_changes finds something interesting about ('foo', 'bar',
'baz') it can then set the cache hint, so that a lookup of ('foo',
'bar') knows right where to find the object. I would also design it so
that it grabs that entry, checks if it is correct. If not, it
invalidates the entry (possibly the whole cache?) and then does the
normal bisect.

This would also have implications for multiple '_get_entry' calls, since
if you just grabbed it for X.rev, it should be extremely fast to grab it
for X.mtime.

Note, I would use something like an LRU cache or even FIFO cache here. I
guess you did somethin along the same lines with the
'_last_entry[_lookup]' functions, though I think I might put it under
the _get_entry() call, rather than above it.

As for code that does "(file_id, None)" versus "(file_id, path)"...

file_id lookups are rather expensive. We create a mapping on-demand from
file-id => path, but that first time it requires walking the entire state.
I'll also note that your choice of lookup key precludes supporting that
anyway. You have a single entry, and it is *either* file_id or path.

I'm also 95% sure that file_id is *never* None. Which means you are
*always* caching based on file_id, and never based on path. Which is ok,
but if we are going to do that, just do it directly, and don't try to
get tricky with "path:%s" tricks.

So....

1) I'd rather the caching was done *by* dirstate._get_entry
2) I think we can use an LRU cache with a low number, but more than 1
3) I'd like to see the cache become a 'hint', and have the _get_entry
code validate that the object hinted really is the object it thinks it
is. And I'd like it to do that by storing the location, and not the raw
object. (so cache[key] = (dirblock_offset, sub_offset), rather than
cache[key] = direntry,
and then lookup becomes:
  try:
    db_offset, sub_offset = cache[key]
    entry = self._dirblocks[db_offset][sub_offset]
  except (KeyError, IndexError), e:
    pass # fallback
  else:
    if entry[0] == (file_id, ...):
      return entry
    # the entry didn't match, fallback to the original check
  bisect()...

Double check if 'file_id' is always given in these cases, but I'm 95%
sure that it is. So we probably need to make the key:
(file_id, parent_index)

Either that, or we could make the key just file_id, and allow for
multiple entries because the file was renamed between parents, etc.

4) If you feel really ambitious, we could work on defining a better
iter_changes() api...

Does this seem distressingly difficult to do?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoeY4kACgkQJdeBCYSNAANcpwCeKDgCaLF3vVKFNd9VQqipkbNm
IWoAoKqKZ/qXb4eT+uLBZVax+qjQDegs
=Nki4
-----END PGP SIGNATURE-----

« Back to merge proposal