On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:

I've trimmed things that are fine.

[debug flags] 
> I hope we don't need to ask people to use them, but they are a cheap insurance
> policy.
> 
> More usefully, they are helpful for benchmarking and testing.  I'm ok with
> hiding them if you like.

I don't have a strong opinion. The flags are in our docs though as it
stands, so they should be clear to people reading them rather than
perhaps causing confusion.

> > > === modified file 'bzrlib/inventory_delta.py'
> > > --- bzrlib/inventory_delta.py   2009-04-02 05:53:12 +0000
> > > +++ bzrlib/inventory_delta.py   2009-08-05 02:30:11 +0000
> > 
> > > 
> > > -    def __init__(self, versioned_root, tree_references):
> > > -        """Create an InventoryDeltaSerializer.
> > > +    def __init__(self):
> > > +        """Create an InventoryDeltaSerializer."""
> > > +        self._versioned_root = None
> > > +        self._tree_references = None
> > > +        self._entry_to_content = {
> > > +            'directory': _directory_content,
> > > +            'file': _file_content,
> > > +            'symlink': _link_content,
> > > +        }
> > > +
> > > +    def require_flags(self, versioned_root=None, tree_references=None):
> > > +        """Set the versioned_root and/or tree_references flags for this
> > > +        (de)serializer.
> > 
> > ^ why is this not in the constructor? You make the fields settable only
> > once, which seems identical to being set from __init__, but harder to
> > use. As its required to be called, it should be documented in the class
> > or __init__ docstring or something like that.
> 
> This is a small step towards a larger change that I don't want to tackle right
> now.
> 
> We really ought to have separate classes for the serializer and for the
> deserializer.  The serializer could indeed have these set at construction time,
> I think.
> 
> For deserializing, we don't necessary know, or care, what the flags are; e.g. a
> repo that supports rich-root and tree-refs can deserialise any delta.  It was
> pretty ugly trying to cope with declaring this upfront in the code; hence this
> method which defers it.

In principle yes, but our code always knows the source serializer,
doesn't it?

Anyhow, at the moment it seems unclear and confusing, rather than
clearer. I would find it clearer as either being on the __init__, I
think, or perhaps with two seperate constructors? No biggy, not enough
to block the patch on, but it is a very awkward halfway house at the
moment - and I wouldn't want to see it left that way - so I'm concerned
that you might switch focus away from this straight after landing it.

> > > === modified file 'bzrlib/remote.py'
> [...]
> > > +    @property
> > > +    def repository_class(self):
> > > +        self._ensure_real()
> > > +        return self._custom_format.repository_class
> > > +
> > 
> > ^ this property sets off alarm bells for me. What its for?
> 
> Hmm, good question... ah, I think it's for the to_format.repository_class in
> CHKRepository._get_source:
> 
>     def _get_source(self, to_format):
>         """Return a source for streaming from this repository."""
>         if (to_format.supports_chks and
>             self._format.repository_class is to_format.repository_class and
>             self._format._serializer == to_format._serializer):
> 
> If I take your suggestion later on to just compare serializers this would be
> unnecessary I think.

Yes - please remove it then ;).

> That comment was never accurate, we never did a fallback with a partial stream.
> 
> I've re-added an accurate comment:
> 
>         # Probe for the verb to use with an empty stream before sending the
>         # real stream to it.  We do this both to avoid the risk of sending a
>         # large request that is then rejected, and because we don't want to
>         # implement a way to buffer, rewind, or restart the stream.

Danke.

> > > === modified file 'bzrlib/repofmt/groupcompress_repo.py'
> [...]
> > > +        self.new_pack.finish_content()
> > > +        if len(self.packs) == 1:
> > > +            old_pack = self.packs[0]
> > > +             old_pack.name == self.new_pack._hash.hexdigest():
> > > +                # The single old pack was already optimally packed.
> > > +                self.new_pack.abort()
> > > +                return None
> > 
> > ^ This isn't quite right. It solves the case of 'bzr pack; bzr
> > pack' (and there is a bug open on that too). However it won't fix the
> > case where a single new pack is streamed, cross format, and also happens
> > to be optimally packed in the outcome. So, please file a bug on this
> > additional case.
> 
> I don't understand why that case isn't covered.  Perhaps you should file the bug
> instead of me?

I'd like to make sure I explain it well enough first; once you
understand either of us can file it - and I'll be happy enough to be
that person. 

You've added code that says
'if we have one pack, and we're adding a pack with the same hash while
doing pack(), then we were optimally packed'.
This fixes a bug in 2a 'bzr pack; bzr pack -> boom' (please do look that
bug up and close it in NEWS.

However, because of hints there is another case:
'if we have N packs, and we pack() with a hint of [one_pack], the new
pack we create could have the same hash as one_pack, because one_pack
happens to be optimally packed'.

This will show up in exactly the same manner as what you're hitting
today, on incremental pushes. (Just pushing two one-file revisions, with
a trivial change between them, in two pushes, should trigger this bug).

I think this should be fixed before landing this patch, as its pretty
tightly linked to the conversion logic for this to happen.

What you need to do is change the test from len(self.packs) == 1, to
len(packs being combined) == 1 and that_pack has the same hash.

> > > @@ -879,14 +888,13 @@


> Ok, I've deleted the repository_class part of the if.  Should I delete the
> “to_format.supports_chk” part too?

yes. self is chk_supporting, so self._serializer ==
to_format.serializer, or whatever, is fine.

> > > === modified file 'bzrlib/repofmt/pack_repo.py'
> > > --- bzrlib/repofmt/pack_repo.py 2009-07-01 10:42:14 +0000
> > > +++ bzrlib/repofmt/pack_repo.py 2009-07-16 02:10:20 +0000
> > 
> > > @@ -1567,7 +1574,7 @@
> > >          # determine which packs need changing
> > >          pack_operations = [[0, []]]
> > >          for pack in self.all_packs():
> > > -            if not hint or pack.name in hint:
> > > +            if hint is None or pack.name in hint:
> > 
> > ^- the original form is actually faster, AFAIK. because it skips the in
> > test for a hint of []. I'd rather we didn't change it, for all that its
> > not in a common code path.
> 
> But if hint is None, then “pack.name in hint” will fail.

It would, but it won't execute. As we discussed, if you could change it
back, but add a comment - its clearly worthy of expanding on (in either
form the necessity for testing hint isn't obvious, apparently).

> > > @@ -2093,6 +2100,7 @@
> > >                  # when autopack takes no steps, the names list is still
> > >                  # unsaved.
> > >                  return self._save_pack_names()
> > > +        return []
> > 
> > ^- is this correctness or changing the signature? If its the latter,
> > update the docstring perhaps?
> 
> Both, in a sense.  A clarification if you like... it found it useful to be able
> to distinguish “hint was passed, but empty” and “no hint passed”.  Otherwise
> passing an empty hint triggered a full pack!
> 
> Which docstring documents the return value of
> commit_write_group/_commit_write_group, though?
> 
> I can add this to Repository.commit_write_group's docstring:
> 
>         :return: it may return an opaque hint that can be passed to 'pack'.
> 
> But I think it's correct that the hint should be opaque at that level.  It seems
> like a detail that belongs in the scope of pack_repo.py.  I guess
> InterDifferingSerializer already assumes that the hint is a list...
> 
> Basically, where should I document this?

Repository.commit_write_group sounds fine. And yes, opaque to the repo
is correct.

> > > === modified file 'bzrlib/smart/repository.py'
> > > --- bzrlib/smart/repository.py  2009-06-16 06:46:32 +0000
> > > +++ bzrlib/smart/repository.py  2009-08-04 00:51:24 +0000
> > > @@ -414,8 +418,39 @@
> > >              repository.
> > >          """
> > >          self._to_format = network_format_registry.get(to_network_name)
> > > +        if self._should_fake_unknown():
> > > +            return FailedSmartServerResponse(
> > > +                ('UnknownMethod', 'Repository.get_stream'))
> > >          return None # Signal that we want a body.
> > >  
> > > +    def _should_fake_unknown(self):
> > 
> > ^- the long comment here could well be helped by a docstring. I think
> > the method name could be improved too - perhaps self._support_method(),
> > or self._permit_method(). Returning UnknownMethod is a special way to
> > support it after all.
> 
> Well, UnknownMethod is precisely for “this server says I do not have this
> verb”, so I'm not sure in what sense you can say returning it is a way to
> support this verb.
> 
> I'm not sure how a docstring would make this clearer.  Specific suggestions
> welcome.

As we discussed, I think the method should say what it does; the fact
that the base class encodes more complex policy than a child is a smell
(not one that you need to fix), but it would be very nice to have pydoc
help make it more obvious that the base class does this special
handling.

> > > +class SmartServerRepositoryInsertStream_1_18(SmartServerRepositoryInsertStreamLocked):
> > > +    """Insert a record stream from a RemoteSink into a repository.
> > > +
> > > +    Same as SmartServerRepositoryInsertStreamLocked, except:
> > > +     - the lock token argument is optional
> > > +     - servers that implement this verb accept 'inventory-delta' records in the
> > > +       stream.
> > > +
> > > +    New in 1.18.
> > > +    """
> > > +
> > > +    def do_repository_request(self, repository, resume_tokens, lock_token=None):
> > > +        """StreamSink.insert_stream for a remote repository."""
> > > +        SmartServerRepositoryInsertStreamLocked.do_repository_request(
> > > +            self, repository, resume_tokens, lock_token)
> > 
> > ^ this is really odd -- you upcall, unconditionally, and do nothing
> > else. Is the reason to just make lock_token optional? If so, why not
> > just change the signature on
> > SmartServerRepositoryInsertStreamLocked.do_repository_request? Secondly,
> > the base class will accept deltas too, AFAICT. So really,its that
> > calling the new registered verb is a way to be sure that deltas will be
> > accepted. I'd be happier with just:
> > register SSRISL twice in the registry, and document the
> > improvements/easements. Note that we will/should be locking always (at
> > the api layer, though not necessarily the network) for compatibility
> > with knits - its why we have SSRISLocked anyway. So the change to make
> > lock_token optional seems unneeded.
> 
> I want to avoid incidental changes to old verbs to minimise the risk of
> accidentally introducing cross-version imcompatibilities.  It would be easy for
> someone to glance at the definition of a verb in 1.18 and see that it can take
> an optional arg, and then assume that therefore it always took that arg, and
> write client code accordingly.
> 
> I needed a new verb registration, obviously, and I took the opportunity to make
> the locking optional, which with hindsight is how the interface should have been
> in the first place.  (We didn't add the _locked variant until a release after
> the original, when we realised we needed that capability and our original verb
> didn't have it.)
> 
> So, I'm comfortable that this is
>   a) the right interface (just one verb), and
>   b) being appropriately safe about backwards compatibility (if erring slightly
>      on the side of paranoia).
> 
> To be really paranoid, I *am* actually tempted to explicitly guard against
> inventory-deltas in the old verb, although I'm not sure that actually buys us
> much for the effort involved.

So, if the lock_token was required, there would be:
 - just one verb
 - no backwards compatibility issues

Up to you; I do think less code is better, all things considered.

> I considered leaving it there simply for test selection, but that's not the
> right way to do that, because it can be (and was!) out of sync with what the
> test was actually doing.

You proposed putting class names in literally, that would be fine by me.
Its no less likely to skew, but I think the issue goes deeper than 'it
can skew' - we have many potentially expensive tests being executed in
the same situation, AIUI. So we may well want to revisit this.

> > > +        if not unstacked_repo._format.supports_chks:
> > > +            # these assertions aren't valid for groupcompress repos, which may
> > > +            # transfer data than strictly necessary to avoid breaking up an
> > > +            # already-compressed block of data.
> > > +            self.assertFalse(unstacked_repo.has_revision('left'))
> > > +            self.assertFalse(unstacked_repo.has_revision('right'))
> > 
> > ^ please check the comment, its not quite clear.
> 
> s/transfer data/transfer more data/
> 
> I'm not 100% certain that's actually true?  What do you think with the adjusted
> comment?

Big alarm bell. gc repos aren't permitted to do that for the revision
objects, because of our invariants about revisions. Something else is
wrong - that if _must not_ be needed.

> > 
> > > +    def assertCanStreamRevision(self, repo, revision_id):
> > > +        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
> > > +        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
> > > +        source = repo._get_source(repo._format)
> > > +        for substream_kind, substream in source.get_stream(search):
> > > +            # Consume the substream
> > > +            list(substream)
> 
> > ^ - this assertion, while named nice, isn't strong enough for me. I'd
> > like it to assert that either a delta matching the local changes, or a
> > full inventory, are streamed - and that the newly referenced file texts
> > are present.
> 
> Hmm, fair enough.  That may happen after landing, though, but not long after.

If this is opening a whole, I'd really rather it be done before landing;
especially if you agree its worth doing. If its an existing whole that
refactoring made obvious, then leaving it till later is fine with me.


> > > === modified file 'bzrlib/tests/test_remote.py'
> > > --- bzrlib/tests/test_remote.py 2009-07-30 04:27:05 +0000
> > > +++ bzrlib/tests/test_remote.py 2009-08-05 01:41:03 +0000
> > > 
> > > +    """Tests for using Repository.insert_stream verb when the _1.18 variant is
> > > +    not available.
> > > +
> > > +    This test case is very similar to TestRepositoryInsertStream_1_18.
> > > +    """
> > 
> > Perhaps a base class would make this smaller - it is nearly identical.
> 
> I don't see any elegant ways to cut the duplication at this point.  I haven't
> tried very hard though, because it's just small enough that it fits under my
> threshold for tolerating the ugly...

It went past mine :). TemplateMethod perhaps?


> > I"m not sure how to say 'needs fixing' in review mail, but thats what
> > I'd say.
> > 
> > Its close though.
> 
> Thanks very much for the extensive review.

My pleasure. It will be great to see this land.

-Rob