Code review comment for lp:~spiv/bzr/inventory-delta

Revision history for this message
John A Meinel (jameinel) wrote :

So, to further my discussion. While investigating this, I found some odd bits in _stream_invs_as_deltas. For example:

+ def _stream_invs_as_deltas(self, revision_ids, fulltexts=False):
         from_repo = self.from_repository

...

+ inventories = self.from_repository.iter_inventories(
+ revision_ids, 'topological')
+ # XXX: ideally these flags would be per-revision, not per-repo (e.g.
+ # streaming a non-rich-root revision out of a rich-root repo back into
+ # a non-rich-root repo ought to be allowed)
+ format = from_repo._format
+ flags = (format.rich_root_data, format.supports_tree_reference)
+ invs_sent_so_far = set([_mod_revision.NULL_REVISION])
+ for inv in inventories:
             key = (inv.revision_id,)

...

+ for parent_id in parent_ids:
...
+ if (best_delta is None or
+ len(best_delta) > len(candidate_delta)):
+ best_delta = candidate_delta
+ basis_id = parent_id
+ delta = best_delta
+ invs_sent_so_far.add(basis_id)

^- Notice that once you've prepared a delta for "inv.revision_id" you then set "basis_id" as part of "invs_sent_so_far". Which AFAICT means you will sending most of the inventories as complete texts, because you won't actually think you've sent what you actually have done.

+ yield versionedfile.InventoryDeltaContentFactory(
+ key, parents, None, delta, basis_id, flags, from_repo)

The way you've handled the "no parents are available use NULL", means that you
actually create a delta to NULL for *every* revision, and find out if it is the
smallest one. Which seems inefficient versus just using NULL when nothing else
is available. (Note that IDS goes as far as to not use parents that aren't
cached even if they have been sent, and to fall back to using the last-sent
revision otherwise.)

I've changed this loop around a bit, to avoid some duplication and make it a little bit clearer (at least to me) what is going on. I also added a quick LRUCache for the parent inventories that we've just sent, as re-extracting them from the repository is not going to be efficient. (Especially extracting them one at a time.)

Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.

It seems like we should be consistent at that level. What do you think?

As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to:

            serializer.require_flags(*self._format_flags)

And that, IMO, is asking for trouble. Yes it is most likely correct as written, but it means that you have to line up the tuple created at the start of _stream_invs_as_deltas:
        flags = (format.rich_root_data, format.supports_tree_reference)

with the *arguments* to a function nested 3 calls away.
So how about:

  flags = {'rich_root_data': format.rich_root_data,
           'supports_tree_reference': format.supports_tree_reference
          }
...

and
  serializer.require_flags(**self._format_flags)

It isn't vastly better, but at least they are named arguments, rather than
fairly arbitrary positional ones.

In the end, I don't think that it is ideal that an incremental push of 1
revision will transmit a complete inventory (in delta form). I understand the
limitation is that we currently are unable to 'buffer' these records anywhere
in a persistent state between RPC calls. (Even though we have a bytes-on-the
wire representation, we don't have an index/etc to look them up in.)

I suppose one possibility would be to have the client reject the stream
entirely and request it be re-sent with the basis available. The states are:

1) Pushing to a shared repo, everything should be there that isn't being
pushed, so you should send minimal deltas.

2) Pushing to a newly formed stacked repo. Likely things will be available in
the fallback repo, but we've chosen to *not* give remote smart repositories
access to the fallback.

3) Incremental push to a stacked repository. Will likely already have all the
parent inventories, so we don't need to push them yet again.

Unfortunately we are pesimising 1 & 3 just for the sake of 2. I'd like to see
logic of the form:
  a) These are the revisions I'm going to transmit, assuming other clients will
  have access to this fallback repo.
  b) But the server won't have access to this subset, so this is the
  *accessible* subset which I'll use to determine the bytes-on-the-wire.

Note that this also starts to handle having clients push up parent inventory
texts opportunistically, rather than waiting for the remote side to say "I
don't actually have this parent's inventory, because I'm stacked and you hid my
fallback from me."

Of course, the get_stream_for_missing_keys code is wired up such that clients
ignore the response about missing inventories if the target is a stacked repo,
and just always force up an extra fulltext copy of all parent inventories
anyway. (Because of the bugs in 1.13, IIRC.)

It doesn't have to happen today, but the code as written means that changing a
single file, committing, and pushing that up to a stacked branch is going to
upload at least 2 "complete/full" inventory deltas. (The rev you are pushing,
and its parent.)

For *me* the ssh handshake is probably signficantly more than the time to push
up 2 inventories of bzr.dev, but that tradeoff is very different for people
working over a tiny bandwidth from their GPRS phone, trying to push a critical
bugfix for a Launchpad branch.

Again, the code as written fails to actually transmit data over the smart protocol, I get:
$ time awbzr branch bzrtools_1.9/ -r -100 bzr://localhost/test-2a/b
bzr: ERROR: Version present for / in TREE_ROOT

even with the fixes I've mentioned so far.

Also, I tried to see what the serialized form over the wire was, and I think the code was:

        lines.sort()
        lines[0] = "format: %s\n" % InventoryDeltaSerializer.FORMAT_1
        lines[1] = "parent: %s\n" % old_name
        lines[2] = "version: %s\n" % new_name
        lines[3] = "versioned_root: %s\n" % self._serialize_bool(
            self._versioned_root)
        lines[4] = "tree_references: %s\n" % self._serialize_bool(
            self._tree_references)

^- However, if I traced it correctly, old_name == 'basis_id' which is a string, but new_name == key, aka a tuple.

So we're clearly doing something wrong here.

Digging further, the bytes on the wire don't include things like:
 InventoryDeltaContentFactory.parents (list/tuple of tuple keys)
 InventoryDeltaContentFactory.sha1

Which means that I don't see any way to get perfect mapping out of the deserialized form. (I suppose you can then go to the revision texts to find the parent mapping, but that really isn't nice to do.)

Anyway, it means that what you get out the other side of translating into and then back out of the serialized form is not exactly equal to what went in, which seems very bad (and likely to *not* break when using local actions, and then have it break when using remote ones.)

So I've made some updates into lp:~jameinel/bzr/1.18-inventory-delta

but ultimately this isn't ready to land.

review: Needs Fixing

« Back to merge proposal