Robert Collins wrote:
> On Wed, 2009-08-05 at 02:45 +0000, Andrew Bennetts wrote:
> > === modified file 'NEWS'
> 
> > 
> > +* InterDifferingSerializer has been removed.  The transformations it
> > +  provided are now done automatically by StreamSource.  (Andrew Bennetts)
> 
> ^ has it?

Oops.  Updated:  

* InterDifferingSerializer is now only used locally.  Other fetches that
  would have used InterDifferingSerializer now use the more network
  friendly StreamSource, which now automatically does the same
  transformations as InterDifferingSerializer.  (Andrew Bennetts)

> > === modified file 'bzrlib/fetch.py'
> > --- bzrlib/fetch.py     2009-07-09 08:59:51 +0000
> > +++ bzrlib/fetch.py     2009-07-29 07:08:54 +0000
> 
> > @@ -249,20 +251,77 @@
> >          # yet, and are unlikely to in non-rich-root environments anyway.
> >          root_id_order.sort(key=operator.itemgetter(0))
> >          # Create a record stream containing the roots to create.
> > -        def yield_roots():
> > -            for key in root_id_order:
> > -                root_id, rev_id = key
> > -                rev_parents = parent_map[rev_id]
> > -                # We drop revision parents with different file-ids, because
> > -                # that represents a rename of the root to a different location
> > -                # - its not actually a parent for us. (We could look for that
> > -                # file id in the revision tree at considerably more expense,
> > -                # but for now this is sufficient (and reconcile will catch and
> > -                # correct this anyway).
> > -                # When a parent revision is a ghost, we guess that its root id
> > -                # was unchanged (rather than trimming it from the parent list).
> > -                parent_keys = tuple((root_id, parent) for parent in rev_parents
> > -                    if parent != NULL_REVISION and
> > -                        rev_id_to_root_id.get(parent, root_id) == root_id)
> > -                yield FulltextContentFactory(key, parent_keys, None, '')
> > -        return [('texts', yield_roots())]
> > +        from bzrlib.graph import FrozenHeadsCache
> > +        graph = FrozenHeadsCache(graph)
> > +        new_roots_stream = _new_root_data_stream(
> > +            root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> > +        return [('texts', new_roots_stream)]
> > +
> 
> These functions have a lot of parameters. Perhaps that would work better
> as state on the Source?

Perhaps, although InterDifferingSerializer uses it too (that's where I
originally extracted this code from).  Something to look at again when we remove
IDS, I think.

> They need docs/more docs respectively.

The methods they were refactored from didn't have docs either :P

Docs extended.

> > === modified file 'bzrlib/help_topics/en/debug-flags.txt'
> > --- bzrlib/help_topics/en/debug-flags.txt       2009-07-24 03:15:56 +0000
> > +++ bzrlib/help_topics/en/debug-flags.txt       2009-08-05 02:05:43 +0000
> > @@ -12,6 +12,7 @@
> >                    operations.
> >  -Dfetch           Trace history copying between repositories.
> >  -Dfilters         Emit information for debugging content filtering.
> > +-Dforceinvdeltas  Force use of inventory deltas during generic streaming fetch.
> >  -Dgraph           Trace graph traversal.
> >  -Dhashcache       Log every time a working file is read to determine its hash.
> >  -Dhooks           Trace hook execution.
> > @@ -26,3 +27,7 @@
> >  -Dunlock          Some errors during unlock are treated as warnings.
> >  -Dpack            Emit information about pack operations.
> >  -Dsftp            Trace SFTP internals.
> > +-Dstream          Trace fetch streams.
> > +-DIDS:never       Never use InterDifferingSerializer when fetching.
> > +-DIDS:always      Always use InterDifferingSerializer to fetch if appropriate
> > +                  for the format, even for non-local fetches.
> 
> 
> I'm a bit uncomfortable with all these debug flags. Are they really
> needed? Do we anticipate asking people to use them? If not perhaps we
> shouldn't list them, or should use environment variables.

I hope we don't need to ask people to use them, but they are a cheap insurance
policy.

More usefully, they are helpful for benchmarking and testing.  I'm ok with
hiding them if you like.

> > === modified file 'bzrlib/inventory_delta.py'
> > --- bzrlib/inventory_delta.py   2009-04-02 05:53:12 +0000
> > +++ bzrlib/inventory_delta.py   2009-08-05 02:30:11 +0000
> 
> > 
> > -    def __init__(self, versioned_root, tree_references):
> > -        """Create an InventoryDeltaSerializer.
> > +    def __init__(self):
> > +        """Create an InventoryDeltaSerializer."""
> > +        self._versioned_root = None
> > +        self._tree_references = None
> > +        self._entry_to_content = {
> > +            'directory': _directory_content,
> > +            'file': _file_content,
> > +            'symlink': _link_content,
> > +        }
> > +
> > +    def require_flags(self, versioned_root=None, tree_references=None):
> > +        """Set the versioned_root and/or tree_references flags for this
> > +        (de)serializer.
> 
> ^ why is this not in the constructor? You make the fields settable only
> once, which seems identical to being set from __init__, but harder to
> use. As its required to be called, it should be documented in the class
> or __init__ docstring or something like that.

This is a small step towards a larger change that I don't want to tackle right
now.

We really ought to have separate classes for the serializer and for the
deserializer.  The serializer could indeed have these set at construction time,
I think.

For deserializing, we don't necessary know, or care, what the flags are; e.g. a
repo that supports rich-root and tree-refs can deserialise any delta.  It was
pretty ugly trying to cope with declaring this upfront in the code; hence this
method which defers it.

> > @@ -221,10 +255,18 @@
> >      def parse_text_bytes(self, bytes):
> >          """Parse the text bytes of a serialized inventory delta.
> >  
> > +        If versioned_root and/or tree_references flags were set via
> > +        require_flags, then the parsed flags must match or a BzrError will be
> > +        raised.
> > +
> ...
> 
> the changes to this function look like they do what they are intended to
> do, but I don't understand why thats desirable or necessary: we always
> know whether we want rich/nonrich etc.

But sometimes we want both (or rather, we don't have a preference).  There's no
good reason to reject a "tree_references: false" delta sent to a repo that
supports them.

> > === modified file 'bzrlib/remote.py'
[...]
> > +    @property
> > +    def repository_class(self):
> > +        self._ensure_real()
> > +        return self._custom_format.repository_class
> > +
> 
> ^ this property sets off alarm bells for me. What its for?

Hmm, good question... ah, I think it's for the to_format.repository_class in
CHKRepository._get_source:

    def _get_source(self, to_format):
        """Return a source for streaming from this repository."""
        if (to_format.supports_chks and
            self._format.repository_class is to_format.repository_class and
            self._format._serializer == to_format._serializer):

If I take your suggestion later on to just compare serializers this would be
unnecessary I think.

> >  class RemoteStreamSink(repository.StreamSink):
> >  
> > +    def __init__(self, target_repo):
> > +        repository.StreamSink.__init__(self, target_repo)
> > +
> 
> ^ this isn't needed, as its doing nothing more than upcall.

Oops, thanks.  Debugging detritus.  Removed.

> >          path = target.bzrdir._path_for_remote_call(client)
> > -        if not resume_tokens:
> > -            # XXX: Ugly but important for correctness, *will* be fixed during
> > -            # 1.13 cycle. Pushing a stream that is interrupted results in a
> > -            # fallback to the _real_repositories sink *with a partial stream*.
> > -            # Thats bad because we insert less data than bzr expected. To avoid
> > -            # this we do a trial push to make sure the verb is accessible, and
> > -            # do not fallback when actually pushing the stream. A cleanup patch
> > -            # is going to look at rewinding/restarting the stream/partial
> > -            # buffering etc.
> 
> ^-- this isn't fixed, is it?

That comment was never accurate, we never did a fallback with a partial stream.

I've re-added an accurate comment:

        # Probe for the verb to use with an empty stream before sending the
        # real stream to it.  We do this both to avoid the risk of sending a
        # large request that is then rejected, and because we don't want to
        # implement a way to buffer, rewind, or restart the stream.

> >  class RemoteStreamSource(repository.StreamSource):
> >      """Stream data from a remote server."""
> > @@ -1745,6 +1815,12 @@
> >              sources.append(repo)
> >          return self.missing_parents_chain(search, sources)
> >  
> > +    def get_stream_for_missing_keys(self, missing_keys):
> > +        self.from_repository._ensure_real()
> > +        real_repo = self.from_repository._real_repository
> > +        real_source = real_repo._get_source(self.to_format)
> > +        return real_source.get_stream_for_missing_keys(missing_keys)
> 
> ^ this fixes bug 406686. I don't think you mention that though inNEWS
> etc.

I don't think that bug report was filed when I wrote that :P

Added NEWS entry.

> > === modified file 'bzrlib/repofmt/groupcompress_repo.py'
[...]
> > +        self.new_pack.finish_content()
> > +        if len(self.packs) == 1:
> > +            old_pack = self.packs[0]
> > +             old_pack.name == self.new_pack._hash.hexdigest():
> > +                # The single old pack was already optimally packed.
> > +                self.new_pack.abort()
> > +                return None
> 
> ^ This isn't quite right. It solves the case of 'bzr pack; bzr
> pack' (and there is a bug open on that too). However it won't fix the
> case where a single new pack is streamed, cross format, and also happens
> to be optimally packed in the outcome. So, please file a bug on this
> additional case.

I don't understand why that case isn't covered.  Perhaps you should file the bug
instead of me?

> > @@ -879,14 +888,13 @@
> >  
> >      def _get_source(self, to_format):
> >          """Return a source for streaming from this repository."""
> > -        if to_format.__class__ is self._format.__class__:
> > +        if (to_format.supports_chks and
> > +            self._format.repository_class is to_format.repository_class and
> > +            self._format._serializer == to_format._serializer):
> 
> ^- this change is wrong; repo format classes have constant serializers -
> the check is overkill, and requiring a format to have a constant
> repository class, which if we'd needed we would have just passed that in
> instead. Or to say it differently, either pass the repo class in, or
> don't test this.
> 
> >              # We must be exactly the same format, otherwise stuff like the chk
> > -            # page layout might be different
> > +            # page layout might be different.
> > +            # Actually, this test is just slightly looser than exact so that
> > +            # CHK2 <-> 2a transfers will work.
> >              return GroupCHKStreamSource(self, to_format)
> 
> CHK2 <-> 2a have the same serializer. The test should just look at
> serializer I suspect. 

Ok, I've deleted the repository_class part of the if.  Should I delete the
“to_format.supports_chk” part too?

> > === modified file 'bzrlib/repofmt/pack_repo.py'
> > --- bzrlib/repofmt/pack_repo.py 2009-07-01 10:42:14 +0000
> > +++ bzrlib/repofmt/pack_repo.py 2009-07-16 02:10:20 +0000
> 
> > @@ -1567,7 +1574,7 @@
> >          # determine which packs need changing
> >          pack_operations = [[0, []]]
> >          for pack in self.all_packs():
> > -            if not hint or pack.name in hint:
> > +            if hint is None or pack.name in hint:
> 
> ^- the original form is actually faster, AFAIK. because it skips the in
> test for a hint of []. I'd rather we didn't change it, for all that its
> not in a common code path.

But if hint is None, then “pack.name in hint” will fail.

> > @@ -2093,6 +2100,7 @@
> >                  # when autopack takes no steps, the names list is still
> >                  # unsaved.
> >                  return self._save_pack_names()
> > +        return []
> 
> ^- is this correctness or changing the signature? If its the latter,
> update the docstring perhaps?

Both, in a sense.  A clarification if you like... it found it useful to be able
to distinguish “hint was passed, but empty” and “no hint passed”.  Otherwise
passing an empty hint triggered a full pack!

Which docstring documents the return value of
commit_write_group/_commit_write_group, though?

I can add this to Repository.commit_write_group's docstring:

        :return: it may return an opaque hint that can be passed to 'pack'.

But I think it's correct that the hint should be opaque at that level.  It seems
like a detail that belongs in the scope of pack_repo.py.  I guess
InterDifferingSerializer already assumes that the hint is a list...

Basically, where should I document this?

> > === modified file 'bzrlib/repository.py'
> > --- bzrlib/repository.py        2009-08-04 16:20:05 +0000
> > +++ bzrlib/repository.py        2009-08-05 02:37:11 +0000
> > @@ -924,6 +925,11 @@
> >          """
> >          if self._write_group is not self.get_transaction():
> >              # has an unlock or relock occured ?
> > +            if suppress_errors:
> > +                mutter(
> > +                '(suppressed) mismatched lock context and write group. %r, %r',
> > +                self._write_group, self.get_transaction())
> > +                return
> 
> ^- is suppress_errors defined globally?

It's an argument to abort_write_group.  It seemed a bit ridculous to have an
explicit “don't raise, just abort” flag and then first thing the function does
is possibly raise an error ignoring that flag!  It was masking a deeper failure
for me at one point.

> > +    def _get_convertable_inventory_stream(self, revision_ids,
> > +                                          delta_versus_null=False):
> > +        # The source is using CHKs, but the target either doesn't or is has a
> > +        # different serializer.  The StreamSink code expects to be able to
> > +        # convert on the target, so we need to put bytes-on-the-wire that can
> > +        # be converted.  That means inventory deltas (if the remote is <1.18,
> > +        # RemoteStreamSink will fallback to VFS to insert the deltas).
> 
> ^- broken english 'or is has' In fact, reread the entire comment ;).

s/is/it/, :)

> > === modified file 'bzrlib/smart/repository.py'
> > --- bzrlib/smart/repository.py  2009-06-16 06:46:32 +0000
> > +++ bzrlib/smart/repository.py  2009-08-04 00:51:24 +0000
> > @@ -414,8 +418,39 @@
> >              repository.
> >          """
> >          self._to_format = network_format_registry.get(to_network_name)
> > +        if self._should_fake_unknown():
> > +            return FailedSmartServerResponse(
> > +                ('UnknownMethod', 'Repository.get_stream'))
> >          return None # Signal that we want a body.
> >  
> > +    def _should_fake_unknown(self):
> 
> ^- the long comment here could well be helped by a docstring. I think
> the method name could be improved too - perhaps self._support_method(),
> or self._permit_method(). Returning UnknownMethod is a special way to
> support it after all.

Well, UnknownMethod is precisely for “this server says I do not have this
verb”, so I'm not sure in what sense you can say returning it is a way to
support this verb.

I'm not sure how a docstring would make this clearer.  Specific suggestions
welcome.

> > +class SmartServerRepositoryInsertStream_1_18(SmartServerRepositoryInsertStreamLocked):
> > +    """Insert a record stream from a RemoteSink into a repository.
> > +
> > +    Same as SmartServerRepositoryInsertStreamLocked, except:
> > +     - the lock token argument is optional
> > +     - servers that implement this verb accept 'inventory-delta' records in the
> > +       stream.
> > +
> > +    New in 1.18.
> > +    """
> > +
> > +    def do_repository_request(self, repository, resume_tokens, lock_token=None):
> > +        """StreamSink.insert_stream for a remote repository."""
> > +        SmartServerRepositoryInsertStreamLocked.do_repository_request(
> > +            self, repository, resume_tokens, lock_token)
> 
> ^ this is really odd -- you upcall, unconditionally, and do nothing
> else. Is the reason to just make lock_token optional? If so, why not
> just change the signature on
> SmartServerRepositoryInsertStreamLocked.do_repository_request? Secondly,
> the base class will accept deltas too, AFAICT. So really,its that
> calling the new registered verb is a way to be sure that deltas will be
> accepted. I'd be happier with just:
> register SSRISL twice in the registry, and document the
> improvements/easements. Note that we will/should be locking always (at
> the api layer, though not necessarily the network) for compatibility
> with knits - its why we have SSRISLocked anyway. So the change to make
> lock_token optional seems unneeded.

I want to avoid incidental changes to old verbs to minimise the risk of
accidentally introducing cross-version imcompatibilities.  It would be easy for
someone to glance at the definition of a verb in 1.18 and see that it can take
an optional arg, and then assume that therefore it always took that arg, and
write client code accordingly.

I needed a new verb registration, obviously, and I took the opportunity to make
the locking optional, which with hindsight is how the interface should have been
in the first place.  (We didn't add the _locked variant until a release after
the original, when we realised we needed that capability and our original verb
didn't have it.)

So, I'm comfortable that this is
  a) the right interface (just one verb), and
  b) being appropriately safe about backwards compatibility (if erring slightly
     on the side of paranoia).

To be really paranoid, I *am* actually tempted to explicitly guard against
inventory-deltas in the old verb, although I'm not sure that actually buys us
much for the effort involved.

> > === modified file 'bzrlib/tests/__init__.py'
> > --- bzrlib/tests/__init__.py    2009-08-04 11:40:59 +0000
> > +++ bzrlib/tests/__init__.py    2009-08-05 02:37:11 +0000
> > @@ -1938,6 +1938,16 @@
> >          sio.encoding = output_encoding
> >          return sio
> >  
> > +    def disable_verb(self, verb):
> > +        """Disable a smart server verb for one test."""
> > +        from bzrlib.smart import request
> > +        request_handlers = request.request_handlers
> > +        orig_method = request_handlers.get(verb)
> > +        request_handlers.remove(verb)
> > +        def restoreVerb():
> > +            request_handlers.register(verb, orig_method)
> > +        self.addCleanup(restoreVerb)
> 
> [aside] It would be nice to find a better management pattern for these helpers.

[agreed]

> The blackbox and unit test_push changes are identical; may be ripe for
> refactoring to reduce duplication there.

Yeah, I thought that too... not something I'm going to do right now, and I'm not
sure that it would make a good bug report.

> > === modified file 'bzrlib/tests/per_interrepository/__init__.py'
[...]
> > -    for interrepo_class, repository_format, repository_format_to in formats:
> > -        id = '%s,%s,%s' % (interrepo_class.__name__,
> > -                            repository_format.__class__.__name__,
> > -                            repository_format_to.__class__.__name__)
> > +    for repository_format, repository_format_to in formats:
> > +        id = '%s,%s' % (repository_format.__class__.__name__,
> > +                        repository_format_to.__class__.__name__)
> 
> ^- This means that we can't specify a specific interrepo type in test
> selection. I'm strongly against losing that facility.
[...]
> And I share John's concern about losing coverage with the above. We were
> testing the same helper with different types before, and we're not able
> to do that now. This doesn't seem coupled to your actual intent, perhaps
> it would be better to just backout this change.

See my reply to John.  bzr.dev doesn't have this facility, it just misleadingly
looks like it does.  Nothing was actually using interrepo_class.  I removed the
confusion.

I considered leaving it there simply for test selection, but that's not the
right way to do that, because it can be (and was!) out of sync with what the
test was actually doing.

> > +        if not unstacked_repo._format.supports_chks:
> > +            # these assertions aren't valid for groupcompress repos, which may
> > +            # transfer data than strictly necessary to avoid breaking up an
> > +            # already-compressed block of data.
> > +            self.assertFalse(unstacked_repo.has_revision('left'))
> > +            self.assertFalse(unstacked_repo.has_revision('right'))
> 
> ^ please check the comment, its not quite clear.

s/transfer data/transfer more data/

I'm not 100% certain that's actually true?  What do you think with the adjusted
comment?

> 
> > +    def assertCanStreamRevision(self, repo, revision_id):
> > +        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
> > +        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
> > +        source = repo._get_source(repo._format)
> > +        for substream_kind, substream in source.get_stream(search):
> > +            # Consume the substream
> > +            list(substream)

> ^ - this assertion, while named nice, isn't strong enough for me. I'd
> like it to assert that either a delta matching the local changes, or a
> full inventory, are streamed - and that the newly referenced file texts
> are present.

Hmm, fair enough.  That may happen after landing, though, but not long after.

> > === modified file 'bzrlib/tests/test_inventory_delta.py'
> > --- bzrlib/tests/test_inventory_delta.py        2009-04-02 05:53:12 +0000
> > +++ bzrlib/tests/test_inventory_delta.py        2009-08-05 02:30:11 +0000
> > ...
> 
> The changes in this test script match what the inventory_delta change -
> but as I'm not convinced by that change they feel somewhat awkard - and
> incomplete, as there are more dimensions for the object to vary in now.

It is a bit awkward, I think splitting the serializer and deserializer would
help that.

> > === modified file 'bzrlib/tests/test_remote.py'
> > --- bzrlib/tests/test_remote.py 2009-07-30 04:27:05 +0000
> > +++ bzrlib/tests/test_remote.py 2009-08-05 01:41:03 +0000
> > 
> > +    """Tests for using Repository.insert_stream verb when the _1.18 variant is
> > +    not available.
> > +
> > +    This test case is very similar to TestRepositoryInsertStream_1_18.
> > +    """
> 
> Perhaps a base class would make this smaller - it is nearly identical.

I don't see any elegant ways to cut the duplication at this point.  I haven't
tried very hard though, because it's just small enough that it fits under my
threshold for tolerating the ugly...

> > +    def test_stream_with_inventory_deltas(self):
> > +        """'inventory-deltas' substreams can't be sent to the
> > +        Repository.insert_stream verb.  So when one is encountered the
> > +        RemoteSink immediately stops using that verb and falls back to VFS
> > +        insert_stream.
> > +        """
> 
> ^this comment isn't strictly true. We *can* send them, to servers
> running 1.18. But not all servers will *accept* them, and thats why we
> want to be sure we *won't* send them. I appreciate this is nitpicking,
> but we're encoding knowledge about the system for later developers, and
> this is worth being clear on, I think.

Fair enough.  Clarified:

        """'inventory-deltas' substreams cannot be sent to the
        Repository.insert_stream verb, because not all servers that implement
        that verb will accept them.  So when one is encountered the RemoteSink
        immediately stops using that verb and falls back to VFS insert_stream.
        """

> I"m not sure how to say 'needs fixing' in review mail, but thats what
> I'd say.
> 
> Its close though.

Thanks very much for the extensive review.

-Andrew.