Bazaar

Code review comment for lp:~jelmer/bzr/hpss-get-inventories

hpss-get-inventories
Merge into bzr.dev

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2011-12-05:

Am 05/12/11 11:31, schrieb Andrew Bennetts:
> Jelmer Vernooij wrote:
> …
>>> I'm glad this is using inventory-deltas. But if it's worth zlib compressing
>>> them here, shouldn't we do that for inventory-deltas from get_record_stream
>>> too?
>> It's a pretty big inventory delta in this case - for something like gcc it
>> matters for the delta between null: and the initial revision that was
>> requested. I haven't investigated whether it would help for record streams
>> too, but I imagine it would have less of an effect.
> Well, I know that there are already cases that send deltas from null: —
> something to do with stacking perhaps? So reusing this improvement globally
> would be nice.
I guess the best way to do this would be to add a zlib pack record kind
for network streams? I.e. a new entry in
NetworkRecordStream._kind_factory, and something equivalent on the
server side?
>>> Also, could you implement this via the get_record_stream RPC with a new
>>> parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
>>> for
>>> given keys)”?
>>>
>>> On one hand it might feel a bit ugly to make one RPC do so many things,
>>> get_record_stream, do so many things, but on the other I think it provides a
>>> useful pressure to keep the interface minimal and consistent (e.g. whether
>>> records are zlib-compressed).
>>>
>>> Adding this iter_inventories RPC might be the right thing (although putting
>>> “iter” in the name of an RPC feels weird to me, that's surely a property of
>>> the Python API rather than a property of the RPC), but I'd like hear what you
>>> think about the tradeoffs of doing that vs. extending get_record_stream.
>> With regard to the name: Most of the other calls seem to be named after the
>> equivalent method in the client / server, that's why I went with
>> Repository.iter_inventories. I'm not particularly tied to that name, something
>> like "Repository.get_inventories" or "Repository.stream_inventories" seems
>> reasonable to me too.
> I think of using the name of the Python API as a good guide, not a strict rule.
> Certainly there's no insert_stream_locked method on Repository…
>
> “Repository.get_inventories” sounds fine to me.
I've renamed it.
>
>> I'm not sure whether this should be part of get_record_stream. I have a hard
>> time understanding the get_record_stream code as it is, so I'd rather not make
>> it even more complex by adding more parameters - for example,
>> get_record_streams seem to be dependent on the repository on-disk format to an
>> extent - the inventory stream is not, as it's using inventory deltas. In other
>> words, adding another verb was simpler, while keeping it all understandable
>> for a mere mortal like me. :-)
>>
>> Either way, I agree we should be reusing more code between the work I've done
>> recently and the existing record stream calls. But it seems to me that would
>> best be done by refactoring so they e.g. use the same code for sending a
>> stream of blobs of indeterminate length, rather than by all being a part of
>> the same verb.
> Well, the path to reuse we've developed *is* record streams. I'm quite willing
> to believe that get_record_stream's parameters aren't a convenient way to
> express all needs. But I'd really like it the thing that was returned by a new
> verb was a true record stream — something you could pass directly to
> insert_record_stream. The idea is that record streams should be the One True
> Way we have to say “here is a stream of data from a repository.” The existing
> get_record_stream stuff is pretty complex, but IIRC that's from two sources:
>
> 1. the complexity of constructing a definitely
> ready-to-commit-perhaps-after-a-single-fixup-fetch-for-stacking-invariants
> stream for the “give me all revisions/invs/texts/etc records for revisions
> in this subgraph of the revision graph”. That's mostly independent of
> get_record_stream, it just happens to be where the logic for that is
> encoded.
>
> 2. the abstractions/indirections to allow various different repository formats
> to send their own native records directly via record streams.
>
> I think 2 is a strong reason to keep using records streams, rather than
> inventing something new and discovering that out of necessity you have ended up
> reinventing it.
>
> The basic structure of a record stream itself is very simple and lightweight
> (although it may well benefit from being wrapped in zlib compression over the
> wire?), and should be reused as the default decision for repository data unless
> there's a really good reason not to.
>
> The current method of extending get_record_stream by shoehorning more and more
> query types into the one kind of call perhaps isn't a good tradeoff. I think at
> some level it will boil down to the same work though, even if the veneer is
> get_inventories(invs) rather than get_record_stream(JustInventories(invs), …).
I've done some more digging in the get_record_stream verb code today.
Using record streams seems reasonable to me, but I would like to use a
separate verb (rather than Repository.get_stream_1.19 and friends) to
keep the interface simple - there's so much going on in
get_record_stream that we don't need here.

It seems like there are two things we can do - either just send a stream
similar to get_record_stream_1.19 with a single inventory-delta
"substream", or we could just send the substream with inventory deltas.
The latter is simpler and would need slightly less bandwidth, but can't
be used with e.g. insert_record_stream. I'm not sure if we care about
that, though. What do you think?

The body_stream method of the GetInventories server side implementation
would look something like this if we were directly using substreams:

     def body_stream(self, repository, ordering, revids):
         pack_writer = pack.ContainerSerialiser()
         yield pack_writer.begin()
         serializer = inventory_delta.InventoryDeltaSerializer(
             repository.supports_rich_root(),
             repository._format.supports_tree_reference)

         prev_inv = _mod_inventory.Inventory(root_id=None,
             revision_id=_mod_revision.NULL_REVISION)
         self._repository.lock_read()
         try:
             for inv, revid in repository._iter_inventories(revids,
ordering):
                 if inv is None:
                     continue
                 inv_delta = inv._make_delta(prev_inv)
                 lines = serializer.delta_to_lines(
                     prev_inv.revision_id, inv.revision_id, inv_delta)
                 yield pack_writer.bytes_record("".join(lines),
[('inventory-delta',)])
                 prev_inv = inv
         finally:
             self._repository.unlock()
         yield pack_writer.end()

Thanks again for reviewing this - I really appreciate your thoughts.

Cheers,

Jelmer

Am 05/12/11 11:31, schrieb Andrew Bennetts:
> Jelmer Vernooij wrote:
> …
>>> I'm glad this is using inventory-deltas.  But if it's worth zlib compressing
>>> them here, shouldn't we do that for inventory-deltas from get_record_stream
>>> too?
>> It's a pretty big inventory delta in this case - for something like gcc it
>> matters for the delta between null: and the initial revision that was
>> requested. I haven't investigated whether it would help for record streams
>> too, but I imagine it would have less of an effect.
> Well, I know that there are already cases that send deltas from null: —
> something to do with stacking perhaps?  So reusing this improvement globally
> would be nice.
I guess the best way to do this would be to add a zlib pack record kind 
for network streams? I.e. a new entry in 
NetworkRecordStream._kind_factory, and something equivalent on the 
server side?
>>> Also, could you implement this via the get_record_stream RPC with a new
>>> parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
>>> for
>>> given keys)”?
>>>
>>> On one hand it might feel a bit ugly to make one RPC do so many things,
>>> get_record_stream, do so many things, but on the other I think it provides a
>>> useful pressure to keep the interface minimal and consistent (e.g. whether
>>> records are zlib-compressed).
>>>
>>> Adding this iter_inventories RPC might be the right thing (although putting
>>> “iter” in the name of an RPC feels weird to me, that's surely a property of
>>> the Python API rather than a property of the RPC), but I'd like hear what you
>>> think about the tradeoffs of doing that vs. extending get_record_stream.
>> With regard to the name: Most of the other calls seem to be named after the
>> equivalent method in the client / server, that's why I went with
>> Repository.iter_inventories. I'm not particularly tied to that name, something
>> like "Repository.get_inventories" or "Repository.stream_inventories" seems
>> reasonable to me too.
> I think of using the name of the Python API as a good guide, not a strict rule.
> Certainly there's no insert_stream_locked method on Repository…
>
> “Repository.get_inventories” sounds fine to me.
I've renamed it.
>
>> I'm not sure whether this should be part of get_record_stream. I have a hard
>> time understanding the get_record_stream code as it is, so I'd rather not make
>> it even more complex by adding more parameters - for example,
>> get_record_streams seem to be dependent on the repository on-disk format to an
>> extent - the inventory stream is not, as it's using inventory deltas. In other
>> words, adding another verb was simpler, while keeping it all understandable
>> for a mere mortal like me. :-)
>>
>> Either way, I agree we should be reusing more code between the work I've done
>> recently and the existing record stream calls. But it seems to me that would
>> best be done by refactoring so they e.g. use the same code for sending a
>> stream of blobs of indeterminate length, rather than by all being a part of
>> the same verb.
> Well, the path to reuse we've developed *is* record streams.  I'm quite willing
> to believe that get_record_stream's parameters aren't a convenient way to
> express all needs.  But I'd really like it the thing that was returned by a new
> verb was a true record stream — something you could pass directly to
> insert_record_stream.  The idea is that record streams should be the One True
> Way we have to say “here is a stream of data from a repository.”  The existing
> get_record_stream stuff is pretty complex, but IIRC that's from two sources:
>
>   1. the complexity of constructing a definitely
>      ready-to-commit-perhaps-after-a-single-fixup-fetch-for-stacking-invariants
>      stream for the “give me all revisions/invs/texts/etc records for revisions
>      in this subgraph of the revision graph”.  That's mostly independent of
>      get_record_stream, it just happens to be where the logic for that is
>      encoded.
>
>   2. the abstractions/indirections to allow various different repository formats
>      to send their own native records directly via record streams.
>
> I think 2 is a strong reason to keep using records streams, rather than
> inventing something new and discovering that out of necessity you have ended up
> reinventing it.
>
> The basic structure of a record stream itself is very simple and lightweight
> (although it may well benefit from being wrapped in zlib compression over the
> wire?), and should be reused as the default decision for repository data unless
> there's a really good reason not to.
>
> The current method of extending get_record_stream by shoehorning more and more
> query types into the one kind of call perhaps isn't a good tradeoff.  I think at
> some level it will boil down to the same work though, even if the veneer is
> get_inventories(invs) rather than get_record_stream(JustInventories(invs), …).
I've done some more digging in the get_record_stream verb code today. 
Using record streams seems reasonable to me, but I would like to use a 
separate verb (rather than Repository.get_stream_1.19 and friends) to 
keep the interface simple - there's so much going on in 
get_record_stream that we don't need here.

It seems like there are two things we can do - either just send a stream 
similar to get_record_stream_1.19 with a single inventory-delta 
"substream", or we could just send the substream with inventory deltas. 
The latter is simpler and would need slightly less bandwidth, but can't 
be used with e.g. insert_record_stream. I'm not sure if we care about 
that, though. What do you think?

The body_stream method of the GetInventories server side implementation 
would look something like this if we were directly using substreams:

prev_inv = _mod_inventory.Inventory(root_id=None,
             revision_id=_mod_revision.NULL_REVISION)
         self._repository.lock_read()
         try:
             for inv, revid in repository._iter_inventories(revids, 
ordering):
                 if inv is None:
                     continue
                 inv_delta = inv._make_delta(prev_inv)
                 lines = serializer.delta_to_lines(
                     prev_inv.revision_id, inv.revision_id, inv_delta)
                 yield pack_writer.bytes_record("".join(lines), 
[('inventory-delta',)])
                 prev_inv = inv
         finally:
             self._repository.unlock()
         yield pack_writer.end()

Thanks again for reviewing this - I really appreciate your thoughts.

Cheers,

Jelmer

« Back to merge proposal