Bazaar

Code review comment for lp:~spiv/bzr/inventory-delta

inventory-delta
Merge into trunk-old

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-06:

+* InterDifferingSerializer has been removed. The transformations it
+ provided are now done automatically by StreamSource. (Andrew Bennetts)
+

^- So for starters this isn't true for now at least.

You added the debug flags "-DIDS:never" and "-DIDS:always". I wonder if they wouldn't be better as IDS_never and IDS_always, just because that would seem to fit better with how we have generally named them. Not a big deal.

+
+
+def _new_root_data_stream(
+ root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):

^- I find the wrapping to be a bit strange, since this and the next function don't have any parameters on the first line. More importantly, though, these should probably at least have a minimal docstring to help understand what is going on.

It is nice that you were able to factor out these helpers so that we can consistently generate the root keys and their parents. It is a shame that _new_root_data_stream is being called on:
rev_id_to_root_id = self._find_root_ids(revs, parent_map, graph)

Which is implemented by iterating all revision trees, which requires parsing all of the revisions. Though it then immediately goes on to use those trees to generate the inventory deltas.

    def _find_root_ids(self, revs, parent_map, graph):
        revision_root = {}
        for tree in self.iter_rev_trees(revs):
            revision_id = tree.inventory.root.revision
            root_id = tree.get_root_id()
            revision_root[revision_id] = root_id
        # Find out which parents we don't already know root ids for
        parents = set()
        for revision_parents in parent_map.itervalues():
            parents.update(revision_parents)
        parents.difference_update(revision_root.keys() + [NULL_REVISION])

^- We've certainly had this pattern enough that we really should consider factoring it out into a helper. But it seems the preferred form is:

parents = set()
map(parents.update, parent_map.itervalues())
parents.difference_update(revision_root)
parents.discard(NULL_REVISION)

That should perform better than what we have. (No need to generate a list from keys() nor append to it NULL_REVISION, creating potentially a 3rd all-keys list.)

        # Limit to revisions present in the versionedfile
        parents = graph.get_parent_map(parents).keys()
        for tree in self.iter_rev_trees(parents):
            root_id = tree.get_root_id()
            revision_root[tree.get_revision_id()] = root_id
        return revision_root

I've been doing some more performance testing, and I've basically seen stuff
like:

bzr branch mysql-525 local -DIDS:always
3m15s

bzr branch mysql-525 local -DIDS:always --XML8-nocopy
2m30s

bzr branch mysql-525 local -DIDS:never
4m01s

bzr branch mysql-525 bzr://localhost/remote -DIDS:never
3m25s

So it is slower, but only about 25% slower (except for the extra-optimized
XML8-nocopy).

I think the new code converts almost as fast, and the extra repacking from IDS
actually costs it a bit of time (other than saving it disk space). The only
major deficit at this point is that there is no progress indication. So I'd
probably stick with IDS for local operations just because of that.

I'll try to finish reviewing the rest, but at least my initial performance
concerns have been addressed.

+* InterDifferingSerializer has been removed.  The transformations it
+  provided are now done automatically by StreamSource.  (Andrew Bennetts)
+