Overall, this is looking pretty good. A few small tweaks here and there, and possible concerns. But only one major concern. You changed the default return value of "iter_inventories()" to return 'unordered' results, which means that "revision_trees()" also not returns 'unordered' results. Which is fairly serious, and I was able to find more than one code path that was relying on the ordering of revision_trees(). So I think the idea is sound, but it needs to be exposed in a backwards compatible manner by making it a flag that defaults to "as-requested" ordering, and then we fix up code paths as we can. I don't specifically need to review that change, though. And I don't really want to review this patch yet again :). So I'm voting tweak, and you can submit if you agree with my findings. 119 + from bzrlib.graph import FrozenHeadsCache 120 + graph = FrozenHeadsCache(graph) 121 + new_roots_stream = _new_root_data_stream( 122 + root_id_order, rev_id_to_root_id, parent_map, self.source, graph) 123 + return [('texts', new_roots_stream)] ^- I'm pretty sure if we are using FrozenHeadsCache that we really want to use KnownGraph instead. We have a dict 'parent_map' which means we can do much more efficient heads() checks since the whole graph is already loaded. This is a minor thing we can do later, but it would probably be good not to forget. ... 392 - elif last_modified[-1] == ':': 393 - raise errors.BzrError('special revisionid found: %r' % line) 394 - if not delta_tree_references and content.startswith('tree\x00'): 395 + elif newpath_utf8 != 'None' and last_modified[-1] == ':': 396 + # Deletes have a last_modified of null:, but otherwise special 397 + # revision ids should not occur. 398 + raise errors.BzrError('special revisionid found: %r' % line) 399 + if delta_tree_references is False and content.startswith('tree\x00'): ^- What does "newpath_utf8 != 'None'" mean if someone does: touch None bzr add None bzr commit -m "adding None" Is this a serialization bug waiting to be exposed? I guess not as it seems the paths are always prefixed with "/", right? (So a path of None would actually be "newpath_utf8 == '/None'") However, further down (sorry about the bad indenting): 413 if newpath_utf8 == 'None': 414 newpath = None ^- here you set "newpath=None" but you *don't* set "newpath_utf8" to None. 415 + elif newpath_utf8[:1] != '/': 416 + raise errors.BzrError( 417 + "newpath invalid (does not start with /): %r" 418 + % (newpath_utf8,)) 419 else: 420 + # Trim leading slash 421 + newpath_utf8 = newpath_utf8[1:] 422 newpath = newpath_utf8.decode('utf8') 423 + content_tuple = tuple(content.split('\x00')) 424 + if content_tuple[0] == 'deleted': 425 + entry = None 426 + else: 427 + entry = _parse_entry( 428 + newpath_utf8, file_id, parent_id, last_modified, 429 + content_tuple) And then here "newpath_utf8" is passed to _parse_entry. Now I realize this is probably caught by "content_tuple[0] == 'deleted'". Though it feels a bit icky to rely on "newpath_utf8" in one portion and "content_tuple[0]" in another (since they can potentially be out of sync.) I think if we just force an early error with: if newpath_utf8 == 'None': newpath = None newpath_utf8 = None Then if there is an issue at least it fails rather than succeeding to create a new file with the name "None". if utf8_path.startswith('/'): ^- If this is a core routine (something called for every path) then: if utf8_path[:1] == '/': *is* faster than .startswith() because you 1) Don't have a function call 2) Don't have an attribute lookup I'm assuming this is a function that gets called a lot. If not, don't worry about it. ... 566 + if required_version < (1, 18): 567 + # Remote side doesn't support inventory deltas. Wrap the stream to 568 + # make sure we don't send any. If the stream contains inventory 569 + # deltas we'll interrupt the smart insert_stream request and 570 + # fallback to VFS. 571 + stream = self._stop_stream_if_inventory_delta(stream) ^- it seems a bit of a shame that if we don't support deltas we fall back to VFS completely, rather than trying something intermediate (like falling back to the original code path of sending full inventory texts, or IDS, or...) I think we are probably okay, but this code at least raises a flag. I expect a bug report along the lines of "fetching between 1.18 and older server is very slow". I haven't looked at all the code paths to determine if 1.18 will have regressed against a 1.17 server. Especially when *not* converting formats. Have you at least manually tested this? ... 703 + self.new_pack.finish_content() 704 + if len(self.packs) == 1: 705 + old_pack = self.packs[0] 706 + if old_pack.name == self.new_pack._hash.hexdigest(): 707 + # The single old pack was already optimally packed. 708 + self.new_pack.abort() 709 + return None ^- I think this is a good fix, but 1) It really should have a NEWS entry and a test case [it might have a test], as I'm sure there are people following the bug. 2) It probably should say something about the repository already being optimally packed. At a minimum a mutter() but I'm thinking it is a reasonable 'note()'. 737 def _get_source(self, to_format): 738 """Return a source for streaming from this repository.""" 739 - if isinstance(to_format, remote.RemoteRepositoryFormat): 740 - # Can't just check attributes on to_format with the current code, 741 - # work around this: 742 - to_format._ensure_real() 743 - to_format = to_format._custom_format 744 - if to_format.__class__ is self._format.__class__: 745 + if (to_format.supports_chks and 746 + self._format.repository_class is to_format.repository_class and 747 + self._format._serializer == to_format._serializer): 748 # We must be exactly the same format, otherwise stuff like the chk 749 - # page layout might be different 750 + # page layout might be different. 751 + # Actually, this test is just slightly looser than exact so that 752 + # CHK2 <-> 2a transfers will work. ^- I think this should probably be ".network_name() == other.network_name()" and we just customize the names to be the same. Is that possible to do? 854 + def iter_inventories(self, revision_ids, ordering='unordered'): ^- the "iter_inventories" api was written such that it always returned the inventories in the *same* order as the supplied revisions, and we've had code that did: for rev, inv in izip(revs, iter_inventories(revs)): ... I've tried to get away from it, because of inv.revision_id, but I think this is a significant API break that should be documented. That is the original reason *why* _iter_inventory_xmls() always buffered the texts before yielding. There are a few places that you didn't fix, such as: bzrlib/bundle/serializer/v4.py:369: for inv in self.repository.iter_inventories(needed_inventories) ^- This, specifically, is one of the places I was working on with my bundles + 2a fixes, and the supplied ordering is *required* (though it happens to also be topological.) I'm concerned that we may have more layers that you don't realize built on top of this. For example: def revision_trees(self, revision_ids): """Return Trees for revisions in this repository. :param revision_ids: a sequence of revision-ids; a revision-id may not be None or 'null:' """ inventories = self.iter_inventories(revision_ids) for inv in inventories: yield RevisionTree(self, inv, inv.revision_id) ^- And "revision_trees" *also* is explicitly stated that it returns the trees in the order requested. And you get stuff like: bzrlib/repository.py:580: revtrees = list(self.repository.revision_trees(self.parents)) ^- this is a rather serious one in "record_iter_changes" as it really requires the ordering. We could layer the ordering on top of it, etc. However, I really think for *your* patch, we shouldn't change the return ordering of "iter_inventories". We could change it as: def iter_inventories(self, revisions, ordering=None): if ordering is None or ordering == 'as-requested': # return inventories in the exact order as requested ... And then for new code that we know we can trust will handle results in arbitrary order (or make an explicit request on the order), then we can do better. It also gives us a way to transition, and mark code that has been fixed. The parameter could be extended to be passed from "revision_trees()". 1185 + elif from_format.network_name() == self.to_format.network_name(): 1186 + # Same format. 1187 + return self._get_simple_inventory_stream(revision_ids, 1188 + missing=missing) 1189 + elif (not from_format.supports_chks and not self.to_format.supports_chks 1190 + and from_format._serializer == self.to_format._serializer): 1191 + # Essentially the same format. 1192 + return self._get_simple_inventory_stream(revision_ids, 1193 + missing=missing) ^- would the second "elif" be better as just an "or ..." on the first elif? ... 1434 + def _should_fake_unknown(self): 1435 + # This is a workaround for bugs in pre-1.18 clients that claim to ^- This seems interesting, is it well tested that this is triggered? (And did you manually use an older bzr and test it with a newer server?) 1660 - result.append((InterRepository, 1661 - weaverepo.RepositoryFormat5(), 1662 - knitrepo.RepositoryFormatKnit3())) 1663 - result.append((InterRepository, 1664 - knitrepo.RepositoryFormatKnit1(), 1665 - knitrepo.RepositoryFormatKnit3())) 1666 - result.append((InterRepository, 1667 - knitrepo.RepositoryFormatKnit1(), 1668 - knitrepo.RepositoryFormatKnit3())) 1669 - result.append((InterKnitRepo, 1670 - knitrepo.RepositoryFormatKnit1(), 1671 - pack_repo.RepositoryFormatKnitPack1())) 1672 - result.append((InterKnitRepo, 1673 - pack_repo.RepositoryFormatKnitPack1(), 1674 - knitrepo.RepositoryFormatKnit1())) 1675 - result.append((InterKnitRepo, 1676 - knitrepo.RepositoryFormatKnit3(), 1677 - pack_repo.RepositoryFormatKnitPack3())) 1678 - result.append((InterKnitRepo, 1679 - pack_repo.RepositoryFormatKnitPack3(), 1680 - knitrepo.RepositoryFormatKnit3())) 1681 - result.append((InterKnitRepo, 1682 - pack_repo.RepositoryFormatKnitPack3(), 1683 - pack_repo.RepositoryFormatKnitPack4())) 1684 - result.append((InterDifferingSerializer, 1685 - pack_repo.RepositoryFormatKnitPack1(), 1686 - pack_repo.RepositoryFormatKnitPack6RichRoot())) 1687 + add_combo(weaverepo.RepositoryFormat5(), 1688 + knitrepo.RepositoryFormatKnit3()) 1689 + add_combo(knitrepo.RepositoryFormatKnit1(), 1690 + knitrepo.RepositoryFormatKnit3()) 1691 + add_combo(knitrepo.RepositoryFormatKnit1(), 1692 + pack_repo.RepositoryFormatKnitPack1()) ^- so aside from some obvious duplication that needed to be removed, this looks like it removes some test permutations. Namely, testing that both "InterRepository" *and* "InterKnitRepository" work between certain repository formats. I'm a little concerned that we are opening potential holes here. What do you think? I can't say that I did a fully thorough review of all the tests. my primary concern is the change to "iter_inventories" which breaks a bunch of assumptions from higher up code.