Bazaar

Code review comment for lp:~jelmer/bzr/hpss-get-inventories

hpss-get-inventories
Merge into bzr.dev

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-12-05:

Jelmer Vernooij wrote:
…
> > I'm glad this is using inventory-deltas. But if it's worth zlib compressing
> > them here, shouldn't we do that for inventory-deltas from get_record_stream
> > too?
> It's a pretty big inventory delta in this case - for something like gcc it
> matters for the delta between null: and the initial revision that was
> requested. I haven't investigated whether it would help for record streams
> too, but I imagine it would have less of an effect.

Well, I know that there are already cases that send deltas from null: —
something to do with stacking perhaps? So reusing this improvement globally
would be nice.

> > Also, could you implement this via the get_record_stream RPC with a new
> > parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
> > for
> > given keys)”?
> >
> > On one hand it might feel a bit ugly to make one RPC do so many things,
> > get_record_stream, do so many things, but on the other I think it provides a
> > useful pressure to keep the interface minimal and consistent (e.g. whether
> > records are zlib-compressed).
> >
> > Adding this iter_inventories RPC might be the right thing (although putting
> > “iter” in the name of an RPC feels weird to me, that's surely a property of
> > the Python API rather than a property of the RPC), but I'd like hear what you
> > think about the tradeoffs of doing that vs. extending get_record_stream.
>
> With regard to the name: Most of the other calls seem to be named after the
> equivalent method in the client / server, that's why I went with
> Repository.iter_inventories. I'm not particularly tied to that name, something
> like "Repository.get_inventories" or "Repository.stream_inventories" seems
> reasonable to me too.

I think of using the name of the Python API as a good guide, not a strict rule.
Certainly there's no insert_stream_locked method on Repository…

“Repository.get_inventories” sounds fine to me.

> I'm not sure whether this should be part of get_record_stream. I have a hard
> time understanding the get_record_stream code as it is, so I'd rather not make
> it even more complex by adding more parameters - for example,
> get_record_streams seem to be dependent on the repository on-disk format to an
> extent - the inventory stream is not, as it's using inventory deltas. In other
> words, adding another verb was simpler, while keeping it all understandable
> for a mere mortal like me. :-)
>
> Either way, I agree we should be reusing more code between the work I've done
> recently and the existing record stream calls. But it seems to me that would
> best be done by refactoring so they e.g. use the same code for sending a
> stream of blobs of indeterminate length, rather than by all being a part of
> the same verb.

Well, the path to reuse we've developed *is* record streams. I'm quite willing
to believe that get_record_stream's parameters aren't a convenient way to
express all needs. But I'd really like it the thing that was returned by a new
verb was a true record stream — something you could pass directly to
insert_record_stream. The idea is that record streams should be the One True
Way we have to say “here is a stream of data from a repository.” The existing
get_record_stream stuff is pretty complex, but IIRC that's from two sources:

1. the complexity of constructing a definitely
    ready-to-commit-perhaps-after-a-single-fixup-fetch-for-stacking-invariants
    stream for the “give me all revisions/invs/texts/etc records for revisions
    in this subgraph of the revision graph”. That's mostly independent of
    get_record_stream, it just happens to be where the logic for that is
    encoded.

2. the abstractions/indirections to allow various different repository formats
to send their own native records directly via record streams.

I think 2 is a strong reason to keep using records streams, rather than
inventing something new and discovering that out of necessity you have ended up
reinventing it.

The basic structure of a record stream itself is very simple and lightweight
(although it may well benefit from being wrapped in zlib compression over the
wire?), and should be reused as the default decision for repository data unless
there's a really good reason not to.

The current method of extending get_record_stream by shoehorning more and more
query types into the one kind of call perhaps isn't a good tradeoff. I think at
some level it will boil down to the same work though, even if the veneer is
get_inventories(invs) rather than get_record_stream(JustInventories(invs), …).

Jelmer Vernooij wrote:
…
> > I'm glad this is using inventory-deltas.  But if it's worth zlib compressing
> > them here, shouldn't we do that for inventory-deltas from get_record_stream
> > too?
> It's a pretty big inventory delta in this case - for something like gcc it
> matters for the delta between null: and the initial revision that was
> requested. I haven't investigated whether it would help for record streams
> too, but I imagine it would have less of an effect.

Well, I know that there are already cases that send deltas from null: —
something to do with stacking perhaps?  So reusing this improvement globally
would be nice.

> > Also, could you implement this via the get_record_stream RPC with a new
> > parameter that means “just inventories (rather than rev+inv+texts+chks+sigs
> > for
> > given keys)”?
> > 
> > On one hand it might feel a bit ugly to make one RPC do so many things,
> > get_record_stream, do so many things, but on the other I think it provides a
> > useful pressure to keep the interface minimal and consistent (e.g. whether
> > records are zlib-compressed).
> > 
> > Adding this iter_inventories RPC might be the right thing (although putting
> > “iter” in the name of an RPC feels weird to me, that's surely a property of
> > the Python API rather than a property of the RPC), but I'd like hear what you
> > think about the tradeoffs of doing that vs. extending get_record_stream.
> 
> With regard to the name: Most of the other calls seem to be named after the
> equivalent method in the client / server, that's why I went with
> Repository.iter_inventories. I'm not particularly tied to that name, something
> like "Repository.get_inventories" or "Repository.stream_inventories" seems
> reasonable to me too.

I think of using the name of the Python API as a good guide, not a strict rule.
Certainly there's no insert_stream_locked method on Repository…

“Repository.get_inventories” sounds fine to me.

> I'm not sure whether this should be part of get_record_stream. I have a hard
> time understanding the get_record_stream code as it is, so I'd rather not make
> it even more complex by adding more parameters - for example,
> get_record_streams seem to be dependent on the repository on-disk format to an
> extent - the inventory stream is not, as it's using inventory deltas. In other
> words, adding another verb was simpler, while keeping it all understandable
> for a mere mortal like me. :-)
> 
> Either way, I agree we should be reusing more code between the work I've done
> recently and the existing record stream calls. But it seems to me that would
> best be done by refactoring so they e.g. use the same code for sending a
> stream of blobs of indeterminate length, rather than by all being a part of
> the same verb.

Well, the path to reuse we've developed *is* record streams.  I'm quite willing
to believe that get_record_stream's parameters aren't a convenient way to
express all needs.  But I'd really like it the thing that was returned by a new
verb was a true record stream — something you could pass directly to
insert_record_stream.  The idea is that record streams should be the One True
Way we have to say “here is a stream of data from a repository.”  The existing
get_record_stream stuff is pretty complex, but IIRC that's from two sources:

1. the complexity of constructing a definitely
    ready-to-commit-perhaps-after-a-single-fixup-fetch-for-stacking-invariants
    stream for the “give me all revisions/invs/texts/etc records for revisions
    in this subgraph of the revision graph”.  That's mostly independent of
    get_record_stream, it just happens to be where the logic for that is
    encoded.

2. the abstractions/indirections to allow various different repository formats
    to send their own native records directly via record streams.

I think 2 is a strong reason to keep using records streams, rather than
inventing something new and discovering that out of necessity you have ended up
reinventing it.

The current method of extending get_record_stream by shoehorning more and more
query types into the one kind of call perhaps isn't a good tradeoff.  I think at
some level it will boil down to the same work though, even if the veneer is
get_inventories(invs) rather than get_record_stream(JustInventories(invs), …).

« Back to merge proposal