Jelmer Vernooij wrote: … > > I'm glad this is using inventory-deltas. But if it's worth zlib compressing > > them here, shouldn't we do that for inventory-deltas from get_record_stream > > too? > It's a pretty big inventory delta in this case - for something like gcc it > matters for the delta between null: and the initial revision that was > requested. I haven't investigated whether it would help for record streams > too, but I imagine it would have less of an effect. Well, I know that there are already cases that send deltas from null: — something to do with stacking perhaps? So reusing this improvement globally would be nice. > > Also, could you implement this via the get_record_stream RPC with a new > > parameter that means “just inventories (rather than rev+inv+texts+chks+sigs > > for > > given keys)”? > > > > On one hand it might feel a bit ugly to make one RPC do so many things, > > get_record_stream, do so many things, but on the other I think it provides a > > useful pressure to keep the interface minimal and consistent (e.g. whether > > records are zlib-compressed). > > > > Adding this iter_inventories RPC might be the right thing (although putting > > “iter” in the name of an RPC feels weird to me, that's surely a property of > > the Python API rather than a property of the RPC), but I'd like hear what you > > think about the tradeoffs of doing that vs. extending get_record_stream. > > With regard to the name: Most of the other calls seem to be named after the > equivalent method in the client / server, that's why I went with > Repository.iter_inventories. I'm not particularly tied to that name, something > like "Repository.get_inventories" or "Repository.stream_inventories" seems > reasonable to me too. I think of using the name of the Python API as a good guide, not a strict rule. Certainly there's no insert_stream_locked method on Repository… “Repository.get_inventories” sounds fine to me. > I'm not sure whether this should be part of get_record_stream. I have a hard > time understanding the get_record_stream code as it is, so I'd rather not make > it even more complex by adding more parameters - for example, > get_record_streams seem to be dependent on the repository on-disk format to an > extent - the inventory stream is not, as it's using inventory deltas. In other > words, adding another verb was simpler, while keeping it all understandable > for a mere mortal like me. :-) > > Either way, I agree we should be reusing more code between the work I've done > recently and the existing record stream calls. But it seems to me that would > best be done by refactoring so they e.g. use the same code for sending a > stream of blobs of indeterminate length, rather than by all being a part of > the same verb. Well, the path to reuse we've developed *is* record streams. I'm quite willing to believe that get_record_stream's parameters aren't a convenient way to express all needs. But I'd really like it the thing that was returned by a new verb was a true record stream — something you could pass directly to insert_record_stream. The idea is that record streams should be the One True Way we have to say “here is a stream of data from a repository.” The existing get_record_stream stuff is pretty complex, but IIRC that's from two sources: 1. the complexity of constructing a definitely ready-to-commit-perhaps-after-a-single-fixup-fetch-for-stacking-invariants stream for the “give me all revisions/invs/texts/etc records for revisions in this subgraph of the revision graph”. That's mostly independent of get_record_stream, it just happens to be where the logic for that is encoded. 2. the abstractions/indirections to allow various different repository formats to send their own native records directly via record streams. I think 2 is a strong reason to keep using records streams, rather than inventing something new and discovering that out of necessity you have ended up reinventing it. The basic structure of a record stream itself is very simple and lightweight (although it may well benefit from being wrapped in zlib compression over the wire?), and should be reused as the default decision for repository data unless there's a really good reason not to. The current method of extending get_record_stream by shoehorning more and more query types into the one kind of call perhaps isn't a good tradeoff. I think at some level it will boil down to the same work though, even if the veneer is get_inventories(invs) rather than get_record_stream(JustInventories(invs), …).