Am 05/12/11 11:31, schrieb Andrew Bennetts: > Jelmer Vernooij wrote: > … >>> I'm glad this is using inventory-deltas. But if it's worth zlib compressing >>> them here, shouldn't we do that for inventory-deltas from get_record_stream >>> too? >> It's a pretty big inventory delta in this case - for something like gcc it >> matters for the delta between null: and the initial revision that was >> requested. I haven't investigated whether it would help for record streams >> too, but I imagine it would have less of an effect. > Well, I know that there are already cases that send deltas from null: — > something to do with stacking perhaps? So reusing this improvement globally > would be nice. I guess the best way to do this would be to add a zlib pack record kind for network streams? I.e. a new entry in NetworkRecordStream._kind_factory, and something equivalent on the server side? >>> Also, could you implement this via the get_record_stream RPC with a new >>> parameter that means “just inventories (rather than rev+inv+texts+chks+sigs >>> for >>> given keys)”? >>> >>> On one hand it might feel a bit ugly to make one RPC do so many things, >>> get_record_stream, do so many things, but on the other I think it provides a >>> useful pressure to keep the interface minimal and consistent (e.g. whether >>> records are zlib-compressed). >>> >>> Adding this iter_inventories RPC might be the right thing (although putting >>> “iter” in the name of an RPC feels weird to me, that's surely a property of >>> the Python API rather than a property of the RPC), but I'd like hear what you >>> think about the tradeoffs of doing that vs. extending get_record_stream. >> With regard to the name: Most of the other calls seem to be named after the >> equivalent method in the client / server, that's why I went with >> Repository.iter_inventories. I'm not particularly tied to that name, something >> like "Repository.get_inventories" or "Repository.stream_inventories" seems >> reasonable to me too. > I think of using the name of the Python API as a good guide, not a strict rule. > Certainly there's no insert_stream_locked method on Repository… > > “Repository.get_inventories” sounds fine to me. I've renamed it. > >> I'm not sure whether this should be part of get_record_stream. I have a hard >> time understanding the get_record_stream code as it is, so I'd rather not make >> it even more complex by adding more parameters - for example, >> get_record_streams seem to be dependent on the repository on-disk format to an >> extent - the inventory stream is not, as it's using inventory deltas. In other >> words, adding another verb was simpler, while keeping it all understandable >> for a mere mortal like me. :-) >> >> Either way, I agree we should be reusing more code between the work I've done >> recently and the existing record stream calls. But it seems to me that would >> best be done by refactoring so they e.g. use the same code for sending a >> stream of blobs of indeterminate length, rather than by all being a part of >> the same verb. > Well, the path to reuse we've developed *is* record streams. I'm quite willing > to believe that get_record_stream's parameters aren't a convenient way to > express all needs. But I'd really like it the thing that was returned by a new > verb was a true record stream — something you could pass directly to > insert_record_stream. The idea is that record streams should be the One True > Way we have to say “here is a stream of data from a repository.” The existing > get_record_stream stuff is pretty complex, but IIRC that's from two sources: > > 1. the complexity of constructing a definitely > ready-to-commit-perhaps-after-a-single-fixup-fetch-for-stacking-invariants > stream for the “give me all revisions/invs/texts/etc records for revisions > in this subgraph of the revision graph”. That's mostly independent of > get_record_stream, it just happens to be where the logic for that is > encoded. > > 2. the abstractions/indirections to allow various different repository formats > to send their own native records directly via record streams. > > I think 2 is a strong reason to keep using records streams, rather than > inventing something new and discovering that out of necessity you have ended up > reinventing it. > > The basic structure of a record stream itself is very simple and lightweight > (although it may well benefit from being wrapped in zlib compression over the > wire?), and should be reused as the default decision for repository data unless > there's a really good reason not to. > > The current method of extending get_record_stream by shoehorning more and more > query types into the one kind of call perhaps isn't a good tradeoff. I think at > some level it will boil down to the same work though, even if the veneer is > get_inventories(invs) rather than get_record_stream(JustInventories(invs), …). I've done some more digging in the get_record_stream verb code today. Using record streams seems reasonable to me, but I would like to use a separate verb (rather than Repository.get_stream_1.19 and friends) to keep the interface simple - there's so much going on in get_record_stream that we don't need here. It seems like there are two things we can do - either just send a stream similar to get_record_stream_1.19 with a single inventory-delta "substream", or we could just send the substream with inventory deltas. The latter is simpler and would need slightly less bandwidth, but can't be used with e.g. insert_record_stream. I'm not sure if we care about that, though. What do you think? The body_stream method of the GetInventories server side implementation would look something like this if we were directly using substreams: def body_stream(self, repository, ordering, revids): pack_writer = pack.ContainerSerialiser() yield pack_writer.begin() serializer = inventory_delta.InventoryDeltaSerializer( repository.supports_rich_root(), repository._format.supports_tree_reference) prev_inv = _mod_inventory.Inventory(root_id=None, revision_id=_mod_revision.NULL_REVISION) self._repository.lock_read() try: for inv, revid in repository._iter_inventories(revids, ordering): if inv is None: continue inv_delta = inv._make_delta(prev_inv) lines = serializer.delta_to_lines( prev_inv.revision_id, inv.revision_id, inv_delta) yield pack_writer.bytes_record("".join(lines), [('inventory-delta',)]) prev_inv = inv finally: self._repository.unlock() yield pack_writer.end() Thanks again for reviewing this - I really appreciate your thoughts. Cheers, Jelmer