John A Meinel wrote:
> Review: Needs Fixing
> So, to further my discussion. While investigating this, I found some odd bits
> in _stream_invs_as_deltas. For example:
[...]
>
> ^- Notice that once you've prepared a delta for "inv.revision_id" you then set
> "basis_id" as part of "invs_sent_so_far". Which AFAICT means you will sending
> most of the inventories as complete texts, because you won't actually think
> you've sent what you actually have done.
Oops, right. I've merged your fix for that.
> + yield versionedfile.InventoryDeltaContentFactory(
> + key, parents, None, delta, basis_id, flags, from_repo)
>
> The way you've handled the "no parents are available use NULL", means that you
> actually create a delta to NULL for *every* revision, and find out if it is the
> smallest one. Which seems inefficient versus just using NULL when nothing else
> is available. (Note that IDS goes as far as to not use parents that aren't
> cached even if they have been sent, and to fall back to using the last-sent
> revision otherwise.)
>
> I've changed this loop around a bit, to avoid some duplication and make it a
> little bit clearer (at least to me) what is going on. I also added a quick
> LRUCache for the parent inventories that we've just sent, as re-extracting
> them from the repository is not going to be efficient. (Especially extracting
> them one at a time.)
Thanks, I've merged your fix.
> Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.
>
> It seems like we should be consistent at that level. What do you think?
It would be nice, but not vital.
> As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to:
[...]
> serializer.require_flags(**self._format_flags)
>
> It isn't vastly better, but at least they are named arguments, rather than
> fairly arbitrary positional ones.
Ok, I'll do that.
> In the end, I don't think that it is ideal that an incremental push of 1
> revision will transmit a complete inventory (in delta form). I understand the
> limitation is that we currently are unable to 'buffer' these records anywhere
> in a persistent state between RPC calls. (Even though we have a bytes-on-the
> wire representation, we don't have an index/etc to look them up in.)
(As said elsewhere on this thread...)
This is only an issue for a cross-format push, which doesn't have to be
maximally efficient. It just has to be reasonable. We can look at doing better
later if we want, but this is already a massive improvement in terms of data
transferred compared to IDS.
[...]
> For *me* the ssh handshake is probably signficantly more than the time to push
> up 2 inventories of bzr.dev, but that tradeoff is very different for people
> working over a tiny bandwidth from their GPRS phone, trying to push a critical
> bugfix for a Launchpad branch.
Then they probably should go to the effort of having their local branch in a
matching format :)
We're not intending to use inventory-deltas for anything but cross-format
fetches AFAIK.
> Again, the code as written fails to actually transmit data over the smart protocol, I get:
> $ time awbzr branch bzrtools_1.9/ -r -100 bzr://localhost/test-2a/b
> bzr: ERROR: Version present for / in TREE_ROOT
>
> even with the fixes I've mentioned so far.
I've found and fixed this. It turned out to be a bug in inventory_delta.py (it
couldn't round-trip deltas from non-rich-root inventories).
[...]
> ^- However, if I traced it correctly, old_name == 'basis_id' which is a string, but new_name == key, aka a tuple.
>
> So we're clearly doing something wrong here.
Yes. I found and fixed this too. I've also made delta_to_lines raise TypeError
if it gets non-strs for old_name or new_name, because that's always going to be
a mistake. The current behaviour (which will silently str a tuple) isn't very
helpful.
> Digging further, the bytes on the wire don't include things like:
> InventoryDeltaContentFactory.parents (list/tuple of tuple keys)
> InventoryDeltaContentFactory.sha1
Inventories don't have parent lists, IIRC? Revisions do, and text versions do,
but inventories don't. (They may have a compression parent, but not a semantic
parent.)
The sha1 is never set, although perhaps it should be.
These two attributes are really just there to satisfy the ContentFactory
interface. After a separate discussion with Robert I'm goint get rid of
InventoryDeltaContentFactory and instead make an explicit inventory-delta
substream that will simply carry FulltextContentFactories of serialised deltas.
(I'm not thrilled at an unnecessary serialise/deseralise pass this will incur in
process, but it doesn't seem to be the bottleneck at the moment.)
Thanks, merged. (Including, for now, the disabling of
InterDifferingSerializer... FWIW, I was doing that by commenting out the
registration call rather than hacking is_compatible :P)
I'll send an updated merge proposal once I've done the change to use a new
substream rather than add a content factory that isn't really a factory for
bytes in a versionedfile.
John A Meinel wrote: invs_as_ deltas. For example:
> Review: Needs Fixing
> So, to further my discussion. While investigating this, I found some odd bits
> in _stream_
[...]
>
> ^- Notice that once you've prepared a delta for "inv.revision_id" you then set
> "basis_id" as part of "invs_sent_so_far". Which AFAICT means you will sending
> most of the inventories as complete texts, because you won't actually think
> you've sent what you actually have done.
Oops, right. I've merged your fix for that.
> + yield versionedfile. InventoryDeltaC ontentFactory(
> + key, parents, None, delta, basis_id, flags, from_repo)
>
> The way you've handled the "no parents are available use NULL", means that you
> actually create a delta to NULL for *every* revision, and find out if it is the
> smallest one. Which seems inefficient versus just using NULL when nothing else
> is available. (Note that IDS goes as far as to not use parents that aren't
> cached even if they have been sent, and to fall back to using the last-sent
> revision otherwise.)
>
> I've changed this loop around a bit, to avoid some duplication and make it a
> little bit clearer (at least to me) what is going on. I also added a quick
> LRUCache for the parent inventories that we've just sent, as re-extracting
> them from the repository is not going to be efficient. (Especially extracting
> them one at a time.)
Thanks, I've merged your fix.
> Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.
>
> It seems like we should be consistent at that level. What do you think?
It would be nice, but not vital.
> As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to: require_ flags(* *self._ format_ flags)
[...]
> serializer.
>
> It isn't vastly better, but at least they are named arguments, rather than
> fairly arbitrary positional ones.
Ok, I'll do that.
> In the end, I don't think that it is ideal that an incremental push of 1
> revision will transmit a complete inventory (in delta form). I understand the
> limitation is that we currently are unable to 'buffer' these records anywhere
> in a persistent state between RPC calls. (Even though we have a bytes-on-the
> wire representation, we don't have an index/etc to look them up in.)
(As said elsewhere on this thread...)
This is only an issue for a cross-format push, which doesn't have to be
maximally efficient. It just has to be reasonable. We can look at doing better
later if we want, but this is already a massive improvement in terms of data
transferred compared to IDS.
[...]
> For *me* the ssh handshake is probably signficantly more than the time to push
> up 2 inventories of bzr.dev, but that tradeoff is very different for people
> working over a tiny bandwidth from their GPRS phone, trying to push a critical
> bugfix for a Launchpad branch.
Then they probably should go to the effort of having their local branch in a
matching format :)
We're not intending to use inventory-deltas for anything but cross-format
fetches AFAIK.
> Again, the code as written fails to actually transmit data over the smart protocol, I get: /test-2a/ b
> $ time awbzr branch bzrtools_1.9/ -r -100 bzr://localhost
> bzr: ERROR: Version present for / in TREE_ROOT
>
> even with the fixes I've mentioned so far.
I've found and fixed this. It turned out to be a bug in inventory_delta.py (it
couldn't round-trip deltas from non-rich-root inventories).
[...]
> ^- However, if I traced it correctly, old_name == 'basis_id' which is a string, but new_name == key, aka a tuple.
>
> So we're clearly doing something wrong here.
Yes. I found and fixed this too. I've also made delta_to_lines raise TypeError
if it gets non-strs for old_name or new_name, because that's always going to be
a mistake. The current behaviour (which will silently str a tuple) isn't very
helpful.
> Digging further, the bytes on the wire don't include things like: ontentFactory. parents (list/tuple of tuple keys) ontentFactory. sha1
> InventoryDeltaC
> InventoryDeltaC
Inventories don't have parent lists, IIRC? Revisions do, and text versions do,
but inventories don't. (They may have a compression parent, but not a semantic
parent.)
The sha1 is never set, although perhaps it should be.
These two attributes are really just there to satisfy the ContentFactory ontentFactory and instead make an explicit inventory-delta Factories of serialised deltas. deseralise pass this will incur in
interface. After a separate discussion with Robert I'm goint get rid of
InventoryDeltaC
substream that will simply carry FulltextContent
(I'm not thrilled at an unnecessary serialise/
process, but it doesn't seem to be the bottleneck at the moment.)
[...]
> So I've made some updates into lp:~jameinel/bzr/1.18-inventory-delta
Thanks, merged. (Including, for now, the disabling of erializer. .. FWIW, I was doing that by commenting out the
InterDifferingS
registration call rather than hacking is_compatible :P)
I'll send an updated merge proposal once I've done the change to use a new
substream rather than add a content factory that isn't really a factory for
bytes in a versionedfile.
-Andrew.