Bazaar

Code review comment for lp:~spiv/bzr/fetch-all-tags-309682

fetch-all-tags-309682
Merge into bzr.dev

Revision history for this message

John A Meinel (jameinel) wrote on 2010-12-14:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> I think much of the mess stems from the weirdness of sprouting a bzrdir rather than a repository or branch directly. I'm starting to think we should remove ControlDir.sprout.
>

I'm in favor of this. It will mess up our test suite a ton, because that
has become the defacto method of creating a new branch. However using:

tree.branch.sprout(new_branch) makes a lot more sense to me than
tree.bzrdir.sprout(new_something)

I think some of it stems from, do you want a WT in the target or not,
etc, etc. I would be happier with something that you can compose. This
is also cluttered because you can't just open the WT from a Branch, it
will re-open the branch, etc. (Note that there was also bugs with
branch.sprout() which didn't happen with bzrdir.sprout(), something with
Trees and merge, and double-locking the WT, IIRC)

Things we want to keep in mind:

a) In the ideal interface, we would have a way to copy any number of
   branches simultaneously into a target. This gives us a way to
   implement some form of Repository cloning that people would like.
   (Being able to grab the stack of a loom, or trunk+main features,
   etc.)
b) I think we do like the ability to branch from a source, and preserve
   the type. However, many extensions probably explicitly work around
   this (bzr branch svn:// is not creating an svn branch, same for
   hg/git/etc)

On 12/6/2010 11:28 PM, Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/fetch-all-tags-309682 into lp:bzr with lp:~spiv/bzr/fetch-integration as a prerequisite.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> #309682 tags are copied but their revisions may not be
> https://bugs.launchpad.net/bugs/309682
>
>
> This branch fixes <https://bugs.launchpad.net/bzr/+bug/309682>: bzr branch should copy revisions referred to in tags, not just the tip.
>
> The main difficulty is the ugliness of sprout and to a lesser extent fetch. We don't want fetching the tags to regress our performance: i.e. no extra roundtrips for redundant locking/unlocking/locking again, no multiple calls to fetch for one sprout, no errors from bzr branch simply because a tag refers to a ghost, etc. At the same time we don't want to make the code any more complex and harder to understand than it already is!
>
> The prerequisite branch addresses much of that, so that this patch on its own is fairly small. (The prerequisite branch is simply lp:~spiv/bzr/sprout-does-not-reopen-repo and lp:~spiv/bzr/fetch-spec-everything-not-in-other merged together. Both of those are in the review queue.)
>
> I've taken the approach of moving much of the complexity out of the sprout method, into more focussed methods and objects which are hopefully easy to understand isolation.
>
> Part of the problem is that the interface of sprout is pretty confusing, as demonstrated by the one test failure I've currently got in this branch: bzrlib.tests.per_controldir.test_controldir.TestControlDir.test_sprout_bzrdir_repository_branch_only_source_under_shared
>
> That test fails because it verifies that bzrdir.sprout(url) from a bzrdir with a branch in a shared repo copies the entire repository. I think this behaviour is wrong, because it's likely to surprise callers. A quick check on IRC suggests that lifeless didn't expect this behaviour either. And quick skim of the bzrlib source suggests that at least 'bzr switch --create-branch' is encountering this behaviour without intending it. So I intend to remove or adjust the test for the new behaviour, but I thought I should get some reviewer feedback first.
>
> More generally, the interaction between the various combinations of passing or not passing revision_id and source_branch and various configurations of source branches and repositories is too complex. It's not a friendly interface. Unfortunately adding consideration of tags into the mix is probably only making it messier. FWIW the behaviour now is:
>
> * If a source branch is passed or found, only its tip and tags are fetched
> * if revision_id is passed, then it overrides the tip
> * otherwise if a revision_id is passed, then that is fetched
> * otherwise the whole repository is fetched.
>
> Finally, while this addresses the case reported in bug 309682 (bzr branch), I think a complete fix requires that pull, merge, update, etc also fetch tags. I have a follow-up branch for that.

+class FetchSpecFactory(object):
+ """A helper for building the best fetch spec for a sprout call.
+
+ Factors that go into determining the sort of fetch to perform:
+ * did the caller specify any revision IDs?
+ * did the caller specify a source branch (need to fetch the tip +
tags)
+ * is there an existing target repo (don't need to refetch revs it
+ already has)
+ * target is stacked? (similar to pre-existing target repo: even if
+ the target itself is new don't want to refetch existing revs)
+
+ :ivar source_branch: the source branch if one specified, else None.
+ :ivar source_branch_stop_revision: fetch up to this revision of
+ source_branch, rather than its tip.
+ :ivar source_repo: the source repository if one found, else None.
+ :ivar target_repo: the target repository acquired by sprout.
+ :ivar target_repo_kind: one of the _TargetRepoKinds constants.
+ """

^- I really prefer calling them "source_branch_stop_revision_id". or
"source_tip_revision_id". I'd like the id to be in there, to make it
clear that this is a string rev_id, not a tuple key, not a Revision
object, etc.

However, shouldn't it be "source_tip_revision_ids" (a list of strings).
To allow more flexibility? We've often been hampered by having a single
revision_id to work with.

I see that you have "explicit_rev_ids = set()" which isn't documented.
Is there a reason to keep "source_branch_stop_revision" at all?

...

+ try:
+ tags_to_fetch.update(
+ self.source_branch.tags.get_reverse_tag_dict())

I'm worried about the performance of this, especially in the case of
stuff like Emacs. IIRC the conversion from CVS brought something like
2000 tags.

While it probably isn't critical for "bzr branch", if "bzr pull"
suddenly has to download and query against 2000 tips, that could get
expensive.

(Note that if we had versioned tags, then you could find the diff of the
tag definition, and then only fetch those tags. As a fairly reasonable
way to get incremental fetch only thinking about incremental changes.)

+ fetch_spec_factory = FetchSpecFactory()
+ if revision_id is not None:
+ fetch_spec_factory.add_revision_ids([revision_id])
+ fetch_spec_factory.source_branch_stop_revision = revision_id

^- This structure is a bit confusing. I'd rather see:

fetch_spec_factory.set_tip_revision(revision_id)

And then have that add_revision_ids() && set the member.

Otherwise it is just a hard-to-get-right api. But see my earlier
discussion about wanting the ability to fetch multiple branches. So maybe:
fetch_spec_factory.add_head() or add_tip_revision_id() or ...

+ heads_to_fetch = set(self.explicit_rev_ids)
+ tags_to_fetch = set()
+ if self.source_branch is not None:
+ try:
+ tags_to_fetch.update(
+ self.source_branch.tags.get_reverse_tag_dict())
+ except errors.TagsNotSupported:
+ pass
+ if self.source_branch_stop_revision is not None:
+ heads_to_fetch.add(self.source_branch_stop_revision)
+ else:
+ heads_to_fetch.add(self.source_branch.last_revision())

^- Note that adding self.source_branch_stop_revision() is redundant
here, with how you called add_revision_ids() and set the pointer.

...

+ for path, file_id in subtrees:
+ target = urlutils.join(url, urlutils.escape(path))
+ sublocation = source_branch.reference_parent(file_id, path)
+ sublocation.bzrdir.sprout(target,
+ basis.get_reference_revision(file_id, path),
+ force_new_repo=force_new_repo, recurse=recurse,
+ stacked=stacked)

^- This is also a hint for a case where we would probably like to do one
large fetch across all the subtrees. Though it is a bit muddled because
Aaron didn't want to enforce that subtrees must share a repository.

The tests seem ok, I wasn't very thorough.

The idea that sprout will use branch.last_revision() if none is supplied
is ok. Though having a way to explicitly request all revisions for a
repository, even if it has a branch, might also be useful...

review: needs_fixing

The big thing is to clarify some of the new fetch spec interface, the
rest is all stuff we can discuss to figure out what will be the nicest
way forward.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0H9WQACgkQJdeBCYSNAANbiACcDHWN+22LzAXoXBo3SU2fImYl
kd0AoKdlM5xlqiLtbDjp6TwXoblnB+ab
=CpfO
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> I think much of the mess stems from the weirdness of sprouting a bzrdir rather than a repository or branch directly.  I'm starting to think we should remove ControlDir.sprout.
>

I'm in favor of this. It will mess up our test suite a ton, because that
has become the defacto method of creating a new branch. However using:

tree.branch.sprout(new_branch) makes a lot more sense to me than
tree.bzrdir.sprout(new_something)

Things we want to keep in mind:

On 12/6/2010 11:28 PM, Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/fetch-all-tags-309682 into lp:bzr with lp:~spiv/bzr/fetch-integration as a prerequisite.
> 
> Requested reviews:
>   bzr-core (bzr-core)
> Related bugs:
>   #309682 tags are copied but their revisions may not be
>   https://bugs.launchpad.net/bugs/309682
> 
> 
> This branch fixes <https://bugs.launchpad.net/bzr/+bug/309682>: bzr branch should copy revisions referred to in tags, not just the tip.
> 
> The main difficulty is the ugliness of sprout and to a lesser extent fetch.  We don't want fetching the tags to regress our performance: i.e. no extra roundtrips for redundant locking/unlocking/locking again, no multiple calls to fetch for one sprout, no errors from bzr branch simply because a tag refers to a ghost, etc.  At the same time we don't want to make the code any more complex and harder to understand than it already is!
> 
> The prerequisite branch addresses much of that, so that this patch on its own is fairly small.  (The prerequisite branch is simply lp:~spiv/bzr/sprout-does-not-reopen-repo and lp:~spiv/bzr/fetch-spec-everything-not-in-other merged together.  Both of those are in the review queue.)
> 
> I've taken the approach of moving much of the complexity out of the sprout method, into more focussed methods and objects which are hopefully easy to understand isolation.
> 
> Part of the problem is that the interface of sprout is pretty confusing, as demonstrated by the one test failure I've currently got in this branch: bzrlib.tests.per_controldir.test_controldir.TestControlDir.test_sprout_bzrdir_repository_branch_only_source_under_shared
> 
> That test fails because it verifies that bzrdir.sprout(url) from a bzrdir with a branch in a shared repo copies the entire repository.  I think this behaviour is wrong, because it's likely to surprise callers.  A quick check on IRC suggests that lifeless didn't expect this behaviour either.  And quick skim of the bzrlib source suggests that at least 'bzr switch --create-branch' is encountering this behaviour without intending it.  So I intend to remove or adjust the test for the new behaviour, but I thought I should get some reviewer feedback first.
> 
> More generally, the interaction between the various combinations of passing or not passing revision_id and source_branch and various configurations of source branches and repositories is too complex.  It's not a friendly interface.  Unfortunately adding consideration of tags into the mix is probably only making it messier.  FWIW the behaviour now is:
> 
>  * If a source branch is passed or found, only its tip and tags are fetched
>    * if revision_id is passed, then it overrides the tip
>  * otherwise if a revision_id is passed, then that is fetched
>  * otherwise the whole repository is fetched.
> 
> Finally, while this addresses the case reported in bug 309682 (bzr branch), I think a complete fix requires that pull, merge, update, etc also fetch tags.  I have a follow-up branch for that.

+class FetchSpecFactory(object):
+    """A helper for building the best fetch spec for a sprout call.
+
+    Factors that go into determining the sort of fetch to perform:
+     * did the caller specify any revision IDs?
+     * did the caller specify a source branch (need to fetch the tip +
tags)
+     * is there an existing target repo (don't need to refetch revs it
+       already has)
+     * target is stacked?  (similar to pre-existing target repo: even if
+       the target itself is new don't want to refetch existing revs)
+
+    :ivar source_branch: the source branch if one specified, else None.
+    :ivar source_branch_stop_revision: fetch up to this revision of
+        source_branch, rather than its tip.
+    :ivar source_repo: the source repository if one found, else None.
+    :ivar target_repo: the target repository acquired by sprout.
+    :ivar target_repo_kind: one of the _TargetRepoKinds constants.
+    """

However, shouldn't it be "source_tip_revision_ids" (a list of strings).
To allow more flexibility? We've often been hampered by having a single
revision_id to work with.

I see that you have "explicit_rev_ids = set()" which isn't documented.
Is there a reason to keep "source_branch_stop_revision" at all?

...

+            try:
+                tags_to_fetch.update(
+                    self.source_branch.tags.get_reverse_tag_dict())

I'm worried about the performance of this, especially in the case of
stuff like Emacs. IIRC the conversion from CVS brought something like
2000 tags.

While it probably isn't critical for "bzr branch", if "bzr pull"
suddenly has to download and query against 2000 tips, that could get
expensive.

+        fetch_spec_factory = FetchSpecFactory()
+        if revision_id is not None:
+            fetch_spec_factory.add_revision_ids([revision_id])
+            fetch_spec_factory.source_branch_stop_revision = revision_id

^- This structure is a bit confusing. I'd rather see:

fetch_spec_factory.set_tip_revision(revision_id)

And then have that add_revision_ids() && set the member.

Otherwise it is just a hard-to-get-right api. But see my earlier
discussion about wanting the ability to fetch multiple branches. So maybe:
  fetch_spec_factory.add_head() or add_tip_revision_id() or ...

+        heads_to_fetch = set(self.explicit_rev_ids)
+        tags_to_fetch = set()
+        if self.source_branch is not None:
+            try:
+                tags_to_fetch.update(
+                    self.source_branch.tags.get_reverse_tag_dict())
+            except errors.TagsNotSupported:
+                pass
+            if self.source_branch_stop_revision is not None:
+                heads_to_fetch.add(self.source_branch_stop_revision)
+            else:
+                heads_to_fetch.add(self.source_branch.last_revision())

^- Note that adding self.source_branch_stop_revision() is redundant
here, with how you called add_revision_ids() and set the pointer.

...

+            for path, file_id in subtrees:
+                target = urlutils.join(url, urlutils.escape(path))
+                sublocation = source_branch.reference_parent(file_id, path)
+                sublocation.bzrdir.sprout(target,
+                    basis.get_reference_revision(file_id, path),
+                    force_new_repo=force_new_repo, recurse=recurse,
+                    stacked=stacked)

The tests seem ok, I wasn't very thorough.

review: needs_fixing

The big thing is to clarify some of the new fetch spec interface, the
rest is all stuff we can discuss to figure out what will be the nicest
way forward.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0H9WQACgkQJdeBCYSNAANbiACcDHWN+22LzAXoXBo3SU2fImYl
kd0AoKdlM5xlqiLtbDjp6TwXoblnB+ab
=CpfO
-----END PGP SIGNATURE-----

review: Needs Fixing

« Back to merge proposal