Merge lp:~jameinel/bzr/1.15-pack-source into lp:~bzr/bzr/trunk-old

Proposed by John A Meinel on 2009-06-02
Status: Merged
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/1.15-pack-source
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 824 lines
To merge this branch: bzr merge lp:~jameinel/bzr/1.15-pack-source
Reviewer Review Type Date Requested Status
Martin Pool 2009-06-02 Approve on 2009-06-16
Review via email: mp+6985@code.launchpad.net
To post a comment you must log in.
John A Meinel (jameinel) wrote :

This proposal changes how pack <=> pack fetching triggers.

It removes the InterPackRepo optimizer (which uses Packer internally) in favor of a new KnitPackStreamSource.

The new source is a very streamlined version of StreamSource, which doesn't attempt to handle all the different cross-format issues. It only supports exact format fetching, and does so in a nice streamlined fashion.

Specifically, it sends data as (signatures, revisions, inventories, texts) since it knows we have atomic insertion.

It walks the inventory pages a single time, and extracts the text keys as the fetch is going, rather than doing so in a pre-read fetch. This is a moderate win for dump transport fetching (versus StreamSource, but not InterPackRepo) because it avoids reading the Inventory pages twice.

It also fixes a bug with the current InterPackRepo code. Namely, the Packer code was recently changed to make sure that all file_keys that are referenced are fetched, rather than only the ones mentioned in the specific revisions being fetched. This was done at ~ the same time as the updates to file_ids_altered_by... However, in updating that, it was not updated to read the parent inventories and remove their text keys.

This meant that if you got a fulltext inventory, you would end up copying the data for all texts in that revision, whether they were modified or not. For bzr.dev, this meant that it often downloaded ~3MB of extra data for a small change. I considered fixing Packer to handle this, but I figured we wanted to move to StreamSource as the one-and-only method for fetching anyway.

I also did a little bit of changes to make it clearer when a set of something was *keys* (tuples) and when it was *ids* (strings).

I also moved some of the helpers that were added as part of the gc-stacking patch, into the base Repository class, so that I could simply re-use them.

Martin Pool (mbp) wrote :

This looks ok to me, though you might want to run the concept past Robert.

review: Approve
Robert Collins (lifeless) wrote :

On Tue, 2009-06-16 at 05:33 +0000, Martin Pool wrote:
> Review: Approve
> This looks ok to me, though you might want to run the concept past Robert.

Conceptually fine. Using Packer was a hack when we had no interface able
to be efficient back in the days of single VersionedFile and Knits.

-Rob

lp:~jameinel/bzr/1.15-pack-source updated on 2009-06-18
4374. By John A Meinel on 2009-06-17

Merge bzr.dev 4454 in preparation for NEWS entry.

4375. By John A Meinel on 2009-06-17

NEWS entry about PackStreamSource

4376. By John A Meinel on 2009-06-17

It seems that fetch() no longer returns the number of revisions fetched.
It still does for *some* InterRepository fetch paths, but the generic one does not.
It is also not easy to get it to, since the Source and Sink are the ones
that would know how many keys were transmitted, and they are potentially 'remote'
objects.

This was also only tested to occur as a by-product in a random 'test_commit' test.
I assume if we really wanted the assurance, we would have a per_repo or interrepo
test for it.

4377. By John A Meinel on 2009-06-18

Change insert_from_broken_repo into an expectedFailure.
This has to do with bug #389141.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/fetch.py'
--- bzrlib/fetch.py 2009-06-10 03:56:49 +0000
+++ bzrlib/fetch.py 2009-06-16 02:36:36 +0000
@@ -51,9 +51,6 @@
51 :param last_revision: If set, try to limit to the data this revision51 :param last_revision: If set, try to limit to the data this revision
52 references.52 references.
53 :param find_ghosts: If True search the entire history for ghosts.53 :param find_ghosts: If True search the entire history for ghosts.
54 :param _write_group_acquired_callable: Don't use; this parameter only
55 exists to facilitate a hack done in InterPackRepo.fetch. We would
56 like to remove this parameter.
57 :param pb: ProgressBar object to use; deprecated and ignored.54 :param pb: ProgressBar object to use; deprecated and ignored.
58 This method will just create one on top of the stack.55 This method will just create one on top of the stack.
59 """56 """
6057
=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- bzrlib/repofmt/groupcompress_repo.py 2009-06-12 01:11:00 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-06-16 02:36:36 +0000
@@ -48,6 +48,7 @@
48 Pack,48 Pack,
49 NewPack,49 NewPack,
50 KnitPackRepository,50 KnitPackRepository,
51 KnitPackStreamSource,
51 PackRootCommitBuilder,52 PackRootCommitBuilder,
52 RepositoryPackCollection,53 RepositoryPackCollection,
53 RepositoryFormatPack,54 RepositoryFormatPack,
@@ -736,21 +737,10 @@
736 # make it raise to trap naughty direct users.737 # make it raise to trap naughty direct users.
737 raise NotImplementedError(self._iter_inventory_xmls)738 raise NotImplementedError(self._iter_inventory_xmls)
738739
739 def _find_parent_ids_of_revisions(self, revision_ids):740 def _find_present_inventory_keys(self, revision_keys):
740 # TODO: we probably want to make this a helper that other code can get741 parent_map = self.inventories.get_parent_map(revision_keys)
741 # at742 present_inventory_keys = set(k for k in parent_map)
742 parent_map = self.get_parent_map(revision_ids)743 return present_inventory_keys
743 parents = set()
744 map(parents.update, parent_map.itervalues())
745 parents.difference_update(revision_ids)
746 parents.discard(_mod_revision.NULL_REVISION)
747 return parents
748
749 def _find_present_inventory_ids(self, revision_ids):
750 keys = [(r,) for r in revision_ids]
751 parent_map = self.inventories.get_parent_map(keys)
752 present_inventory_ids = set(k[-1] for k in parent_map)
753 return present_inventory_ids
754744
755 def fileids_altered_by_revision_ids(self, revision_ids, _inv_weave=None):745 def fileids_altered_by_revision_ids(self, revision_ids, _inv_weave=None):
756 """Find the file ids and versions affected by revisions.746 """Find the file ids and versions affected by revisions.
@@ -767,12 +757,20 @@
767 file_id_revisions = {}757 file_id_revisions = {}
768 pb = ui.ui_factory.nested_progress_bar()758 pb = ui.ui_factory.nested_progress_bar()
769 try:759 try:
770 parent_ids = self._find_parent_ids_of_revisions(revision_ids)760 revision_keys = [(r,) for r in revision_ids]
771 present_parent_inv_ids = self._find_present_inventory_ids(parent_ids)761 parent_keys = self._find_parent_keys_of_revisions(revision_keys)
762 # TODO: instead of using _find_present_inventory_keys, change the
763 # code paths to allow missing inventories to be tolerated.
764 # However, we only want to tolerate missing parent
765 # inventories, not missing inventories for revision_ids
766 present_parent_inv_keys = self._find_present_inventory_keys(
767 parent_keys)
768 present_parent_inv_ids = set(
769 [k[-1] for k in present_parent_inv_keys])
772 uninteresting_root_keys = set()770 uninteresting_root_keys = set()
773 interesting_root_keys = set()771 interesting_root_keys = set()
774 inventories_to_read = set(present_parent_inv_ids)772 inventories_to_read = set(revision_ids)
775 inventories_to_read.update(revision_ids)773 inventories_to_read.update(present_parent_inv_ids)
776 for inv in self.iter_inventories(inventories_to_read):774 for inv in self.iter_inventories(inventories_to_read):
777 entry_chk_root_key = inv.id_to_entry.key()775 entry_chk_root_key = inv.id_to_entry.key()
778 if inv.revision_id in present_parent_inv_ids:776 if inv.revision_id in present_parent_inv_ids:
@@ -846,7 +844,7 @@
846 return super(CHKInventoryRepository, self)._get_source(to_format)844 return super(CHKInventoryRepository, self)._get_source(to_format)
847845
848846
849class GroupCHKStreamSource(repository.StreamSource):847class GroupCHKStreamSource(KnitPackStreamSource):
850 """Used when both the source and target repo are GroupCHK repos."""848 """Used when both the source and target repo are GroupCHK repos."""
851849
852 def __init__(self, from_repository, to_format):850 def __init__(self, from_repository, to_format):
@@ -854,6 +852,7 @@
854 super(GroupCHKStreamSource, self).__init__(from_repository, to_format)852 super(GroupCHKStreamSource, self).__init__(from_repository, to_format)
855 self._revision_keys = None853 self._revision_keys = None
856 self._text_keys = None854 self._text_keys = None
855 self._text_fetch_order = 'groupcompress'
857 self._chk_id_roots = None856 self._chk_id_roots = None
858 self._chk_p_id_roots = None857 self._chk_p_id_roots = None
859858
@@ -898,16 +897,10 @@
898 p_id_roots_set.clear()897 p_id_roots_set.clear()
899 return ('inventories', _filtered_inv_stream())898 return ('inventories', _filtered_inv_stream())
900899
901 def _find_present_inventories(self, revision_ids):900 def _get_filtered_chk_streams(self, excluded_revision_keys):
902 revision_keys = [(r,) for r in revision_ids]
903 inventories = self.from_repository.inventories
904 present_inventories = inventories.get_parent_map(revision_keys)
905 return [p[-1] for p in present_inventories]
906
907 def _get_filtered_chk_streams(self, excluded_revision_ids):
908 self._text_keys = set()901 self._text_keys = set()
909 excluded_revision_ids.discard(_mod_revision.NULL_REVISION)902 excluded_revision_keys.discard(_mod_revision.NULL_REVISION)
910 if not excluded_revision_ids:903 if not excluded_revision_keys:
911 uninteresting_root_keys = set()904 uninteresting_root_keys = set()
912 uninteresting_pid_root_keys = set()905 uninteresting_pid_root_keys = set()
913 else:906 else:
@@ -915,9 +908,9 @@
915 # actually present908 # actually present
916 # TODO: Update Repository.iter_inventories() to add909 # TODO: Update Repository.iter_inventories() to add
917 # ignore_missing=True910 # ignore_missing=True
918 present_ids = self.from_repository._find_present_inventory_ids(911 present_keys = self.from_repository._find_present_inventory_keys(
919 excluded_revision_ids)912 excluded_revision_keys)
920 present_ids = self._find_present_inventories(excluded_revision_ids)913 present_ids = [k[-1] for k in present_keys]
921 uninteresting_root_keys = set()914 uninteresting_root_keys = set()
922 uninteresting_pid_root_keys = set()915 uninteresting_pid_root_keys = set()
923 for inv in self.from_repository.iter_inventories(present_ids):916 for inv in self.from_repository.iter_inventories(present_ids):
@@ -948,14 +941,6 @@
948 self._chk_p_id_roots = None941 self._chk_p_id_roots = None
949 yield 'chk_bytes', _get_parent_id_basename_to_file_id_pages()942 yield 'chk_bytes', _get_parent_id_basename_to_file_id_pages()
950943
951 def _get_text_stream(self):
952 # Note: We know we don't have to handle adding root keys, because both
953 # the source and target are GCCHK, and those always support rich-roots
954 # We may want to request as 'unordered', in case the source has done a
955 # 'split' packing
956 return ('texts', self.from_repository.texts.get_record_stream(
957 self._text_keys, 'groupcompress', False))
958
959 def get_stream(self, search):944 def get_stream(self, search):
960 revision_ids = search.get_keys()945 revision_ids = search.get_keys()
961 for stream_info in self._fetch_revision_texts(revision_ids):946 for stream_info in self._fetch_revision_texts(revision_ids):
@@ -966,8 +951,9 @@
966 # For now, exclude all parents that are at the edge of ancestry, for951 # For now, exclude all parents that are at the edge of ancestry, for
967 # which we have inventories952 # which we have inventories
968 from_repo = self.from_repository953 from_repo = self.from_repository
969 parent_ids = from_repo._find_parent_ids_of_revisions(revision_ids)954 parent_keys = from_repo._find_parent_keys_of_revisions(
970 for stream_info in self._get_filtered_chk_streams(parent_ids):955 self._revision_keys)
956 for stream_info in self._get_filtered_chk_streams(parent_keys):
971 yield stream_info957 yield stream_info
972 yield self._get_text_stream()958 yield self._get_text_stream()
973959
@@ -991,8 +977,8 @@
991 # no unavailable texts when the ghost inventories are not filled in.977 # no unavailable texts when the ghost inventories are not filled in.
992 yield self._get_inventory_stream(missing_inventory_keys,978 yield self._get_inventory_stream(missing_inventory_keys,
993 allow_absent=True)979 allow_absent=True)
994 # We use the empty set for excluded_revision_ids, to make it clear that980 # We use the empty set for excluded_revision_keys, to make it clear
995 # we want to transmit all referenced chk pages.981 # that we want to transmit all referenced chk pages.
996 for stream_info in self._get_filtered_chk_streams(set()):982 for stream_info in self._get_filtered_chk_streams(set()):
997 yield stream_info983 yield stream_info
998984
999985
=== modified file 'bzrlib/repofmt/pack_repo.py'
--- bzrlib/repofmt/pack_repo.py 2009-06-10 03:56:49 +0000
+++ bzrlib/repofmt/pack_repo.py 2009-06-16 02:36:36 +0000
@@ -73,6 +73,7 @@
73 MetaDirRepositoryFormat,73 MetaDirRepositoryFormat,
74 RepositoryFormat,74 RepositoryFormat,
75 RootCommitBuilder,75 RootCommitBuilder,
76 StreamSource,
76 )77 )
77import bzrlib.revision as _mod_revision78import bzrlib.revision as _mod_revision
78from bzrlib.trace import (79from bzrlib.trace import (
@@ -2265,6 +2266,11 @@
2265 pb.finished()2266 pb.finished()
2266 return result2267 return result
22672268
2269 def _get_source(self, to_format):
2270 if to_format.network_name() == self._format.network_name():
2271 return KnitPackStreamSource(self, to_format)
2272 return super(KnitPackRepository, self)._get_source(to_format)
2273
2268 def _make_parents_provider(self):2274 def _make_parents_provider(self):
2269 return graph.CachingParentsProvider(self)2275 return graph.CachingParentsProvider(self)
22702276
@@ -2384,6 +2390,79 @@
2384 repo.unlock()2390 repo.unlock()
23852391
23862392
2393class KnitPackStreamSource(StreamSource):
2394 """A StreamSource used to transfer data between same-format KnitPack repos.
2395
2396 This source assumes:
2397 1) Same serialization format for all objects
2398 2) Same root information
2399 3) XML format inventories
2400 4) Atomic inserts (so we can stream inventory texts before text
2401 content)
2402 5) No chk_bytes
2403 """
2404
2405 def __init__(self, from_repository, to_format):
2406 super(KnitPackStreamSource, self).__init__(from_repository, to_format)
2407 self._text_keys = None
2408 self._text_fetch_order = 'unordered'
2409
2410 def _get_filtered_inv_stream(self, revision_ids):
2411 from_repo = self.from_repository
2412 parent_ids = from_repo._find_parent_ids_of_revisions(revision_ids)
2413 parent_keys = [(p,) for p in parent_ids]
2414 find_text_keys = from_repo._find_text_key_references_from_xml_inventory_lines
2415 parent_text_keys = set(find_text_keys(
2416 from_repo._inventory_xml_lines_for_keys(parent_keys)))
2417 content_text_keys = set()
2418 knit = KnitVersionedFiles(None, None)
2419 factory = KnitPlainFactory()
2420 def find_text_keys_from_content(record):
2421 if record.storage_kind not in ('knit-delta-gz', 'knit-ft-gz'):
2422 raise ValueError("Unknown content storage kind for"
2423 " inventory text: %s" % (record.storage_kind,))
2424 # It's a knit record, it has a _raw_record field (even if it was
2425 # reconstituted from a network stream).
2426 raw_data = record._raw_record
2427 # read the entire thing
2428 revision_id = record.key[-1]
2429 content, _ = knit._parse_record(revision_id, raw_data)
2430 if record.storage_kind == 'knit-delta-gz':
2431 line_iterator = factory.get_linedelta_content(content)
2432 elif record.storage_kind == 'knit-ft-gz':
2433 line_iterator = factory.get_fulltext_content(content)
2434 content_text_keys.update(find_text_keys(
2435 [(line, revision_id) for line in line_iterator]))
2436 revision_keys = [(r,) for r in revision_ids]
2437 def _filtered_inv_stream():
2438 source_vf = from_repo.inventories
2439 stream = source_vf.get_record_stream(revision_keys,
2440 'unordered', False)
2441 for record in stream:
2442 if record.storage_kind == 'absent':
2443 raise errors.NoSuchRevision(from_repo, record.key)
2444 find_text_keys_from_content(record)
2445 yield record
2446 self._text_keys = content_text_keys - parent_text_keys
2447 return ('inventories', _filtered_inv_stream())
2448
2449 def _get_text_stream(self):
2450 # Note: We know we don't have to handle adding root keys, because both
2451 # the source and target are the identical network name.
2452 text_stream = self.from_repository.texts.get_record_stream(
2453 self._text_keys, self._text_fetch_order, False)
2454 return ('texts', text_stream)
2455
2456 def get_stream(self, search):
2457 revision_ids = search.get_keys()
2458 for stream_info in self._fetch_revision_texts(revision_ids):
2459 yield stream_info
2460 self._revision_keys = [(rev_id,) for rev_id in revision_ids]
2461 yield self._get_filtered_inv_stream(revision_ids)
2462 yield self._get_text_stream()
2463
2464
2465
2387class RepositoryFormatPack(MetaDirRepositoryFormat):2466class RepositoryFormatPack(MetaDirRepositoryFormat):
2388 """Format logic for pack structured repositories.2467 """Format logic for pack structured repositories.
23892468
23902469
=== modified file 'bzrlib/repository.py'
--- bzrlib/repository.py 2009-06-12 01:11:00 +0000
+++ bzrlib/repository.py 2009-06-16 02:36:36 +0000
@@ -1919,29 +1919,25 @@
1919 yield line, revid1919 yield line, revid
19201920
1921 def _find_file_ids_from_xml_inventory_lines(self, line_iterator,1921 def _find_file_ids_from_xml_inventory_lines(self, line_iterator,
1922 revision_ids):1922 revision_keys):
1923 """Helper routine for fileids_altered_by_revision_ids.1923 """Helper routine for fileids_altered_by_revision_ids.
19241924
1925 This performs the translation of xml lines to revision ids.1925 This performs the translation of xml lines to revision ids.
19261926
1927 :param line_iterator: An iterator of lines, origin_version_id1927 :param line_iterator: An iterator of lines, origin_version_id
1928 :param revision_ids: The revision ids to filter for. This should be a1928 :param revision_keys: The revision ids to filter for. This should be a
1929 set or other type which supports efficient __contains__ lookups, as1929 set or other type which supports efficient __contains__ lookups, as
1930 the revision id from each parsed line will be looked up in the1930 the revision key from each parsed line will be looked up in the
1931 revision_ids filter.1931 revision_keys filter.
1932 :return: a dictionary mapping altered file-ids to an iterable of1932 :return: a dictionary mapping altered file-ids to an iterable of
1933 revision_ids. Each altered file-ids has the exact revision_ids that1933 revision_ids. Each altered file-ids has the exact revision_ids that
1934 altered it listed explicitly.1934 altered it listed explicitly.
1935 """1935 """
1936 seen = set(self._find_text_key_references_from_xml_inventory_lines(1936 seen = set(self._find_text_key_references_from_xml_inventory_lines(
1937 line_iterator).iterkeys())1937 line_iterator).iterkeys())
1938 # Note that revision_ids are revision keys.1938 parent_keys = self._find_parent_keys_of_revisions(revision_keys)
1939 parent_maps = self.revisions.get_parent_map(revision_ids)
1940 parents = set()
1941 map(parents.update, parent_maps.itervalues())
1942 parents.difference_update(revision_ids)
1943 parent_seen = set(self._find_text_key_references_from_xml_inventory_lines(1939 parent_seen = set(self._find_text_key_references_from_xml_inventory_lines(
1944 self._inventory_xml_lines_for_keys(parents)))1940 self._inventory_xml_lines_for_keys(parent_keys)))
1945 new_keys = seen - parent_seen1941 new_keys = seen - parent_seen
1946 result = {}1942 result = {}
1947 setdefault = result.setdefault1943 setdefault = result.setdefault
@@ -1949,6 +1945,33 @@
1949 setdefault(key[0], set()).add(key[-1])1945 setdefault(key[0], set()).add(key[-1])
1950 return result1946 return result
19511947
1948 def _find_parent_ids_of_revisions(self, revision_ids):
1949 """Find all parent ids that are mentioned in the revision graph.
1950
1951 :return: set of revisions that are parents of revision_ids which are
1952 not part of revision_ids themselves
1953 """
1954 parent_map = self.get_parent_map(revision_ids)
1955 parent_ids = set()
1956 map(parent_ids.update, parent_map.itervalues())
1957 parent_ids.difference_update(revision_ids)
1958 parent_ids.discard(_mod_revision.NULL_REVISION)
1959 return parent_ids
1960
1961 def _find_parent_keys_of_revisions(self, revision_keys):
1962 """Similar to _find_parent_ids_of_revisions, but used with keys.
1963
1964 :param revision_keys: An iterable of revision_keys.
1965 :return: The parents of all revision_keys that are not already in
1966 revision_keys
1967 """
1968 parent_map = self.revisions.get_parent_map(revision_keys)
1969 parent_keys = set()
1970 map(parent_keys.update, parent_map.itervalues())
1971 parent_keys.difference_update(revision_keys)
1972 parent_keys.discard(_mod_revision.NULL_REVISION)
1973 return parent_keys
1974
1952 def fileids_altered_by_revision_ids(self, revision_ids, _inv_weave=None):1975 def fileids_altered_by_revision_ids(self, revision_ids, _inv_weave=None):
1953 """Find the file ids and versions affected by revisions.1976 """Find the file ids and versions affected by revisions.
19541977
@@ -3418,144 +3441,6 @@
3418 return self.source.revision_ids_to_search_result(result_set)3441 return self.source.revision_ids_to_search_result(result_set)
34193442
34203443
3421class InterPackRepo(InterSameDataRepository):
3422 """Optimised code paths between Pack based repositories."""
3423
3424 @classmethod
3425 def _get_repo_format_to_test(self):
3426 from bzrlib.repofmt import pack_repo
3427 return pack_repo.RepositoryFormatKnitPack6RichRoot()
3428
3429 @staticmethod
3430 def is_compatible(source, target):
3431 """Be compatible with known Pack formats.
3432
3433 We don't test for the stores being of specific types because that
3434 could lead to confusing results, and there is no need to be
3435 overly general.
3436
3437 InterPackRepo does not support CHK based repositories.
3438 """
3439 from bzrlib.repofmt.pack_repo import RepositoryFormatPack
3440 from bzrlib.repofmt.groupcompress_repo import RepositoryFormatCHK1
3441 try:
3442 are_packs = (isinstance(source._format, RepositoryFormatPack) and
3443 isinstance(target._format, RepositoryFormatPack))
3444 not_packs = (isinstance(source._format, RepositoryFormatCHK1) or
3445 isinstance(target._format, RepositoryFormatCHK1))
3446 except AttributeError:
3447 return False
3448 if not_packs or not are_packs:
3449 return False
3450 return InterRepository._same_model(source, target)
3451
3452 @needs_write_lock
3453 def fetch(self, revision_id=None, pb=None, find_ghosts=False,
3454 fetch_spec=None):
3455 """See InterRepository.fetch()."""
3456 if (len(self.source._fallback_repositories) > 0 or
3457 len(self.target._fallback_repositories) > 0):
3458 # The pack layer is not aware of fallback repositories, so when
3459 # fetching from a stacked repository or into a stacked repository
3460 # we use the generic fetch logic which uses the VersionedFiles
3461 # attributes on repository.
3462 from bzrlib.fetch import RepoFetcher
3463 fetcher = RepoFetcher(self.target, self.source, revision_id,
3464 pb, find_ghosts, fetch_spec=fetch_spec)
3465 if fetch_spec is not None:
3466 if len(list(fetch_spec.heads)) != 1:
3467 raise AssertionError(
3468 "InterPackRepo.fetch doesn't support "
3469 "fetching multiple heads yet.")
3470 revision_id = list(fetch_spec.heads)[0]
3471 fetch_spec = None
3472 if revision_id is None:
3473 # TODO:
3474 # everything to do - use pack logic
3475 # to fetch from all packs to one without
3476 # inventory parsing etc, IFF nothing to be copied is in the target.
3477 # till then:
3478 source_revision_ids = frozenset(self.source.all_revision_ids())
3479 revision_ids = source_revision_ids - \
3480 frozenset(self.target.get_parent_map(source_revision_ids))
3481 revision_keys = [(revid,) for revid in revision_ids]
3482 index = self.target._pack_collection.revision_index.combined_index
3483 present_revision_ids = set(item[1][0] for item in
3484 index.iter_entries(revision_keys))
3485 revision_ids = set(revision_ids) - present_revision_ids
3486 # implementing the TODO will involve:
3487 # - detecting when all of a pack is selected
3488 # - avoiding as much as possible pre-selection, so the
3489 # more-core routines such as create_pack_from_packs can filter in
3490 # a just-in-time fashion. (though having a HEADS list on a
3491 # repository might make this a lot easier, because we could
3492 # sensibly detect 'new revisions' without doing a full index scan.
3493 elif _mod_revision.is_null(revision_id):
3494 # nothing to do:
3495 return (0, [])
3496 else:
3497 revision_ids = self.search_missing_revision_ids(revision_id,
3498 find_ghosts=find_ghosts).get_keys()
3499 if len(revision_ids) == 0:
3500 return (0, [])
3501 return self._pack(self.source, self.target, revision_ids)
3502
3503 def _pack(self, source, target, revision_ids):
3504 from bzrlib.repofmt.pack_repo import Packer
3505 packs = source._pack_collection.all_packs()
3506 pack = Packer(self.target._pack_collection, packs, '.fetch',
3507 revision_ids).pack()
3508 if pack is not None:
3509 self.target._pack_collection._save_pack_names()
3510 copied_revs = pack.get_revision_count()
3511 # Trigger an autopack. This may duplicate effort as we've just done
3512 # a pack creation, but for now it is simpler to think about as
3513 # 'upload data, then repack if needed'.
3514 self.target._pack_collection.autopack()
3515 return (copied_revs, [])
3516 else:
3517 return (0, [])
3518
3519 @needs_read_lock
3520 def search_missing_revision_ids(self, revision_id=None, find_ghosts=True):
3521 """See InterRepository.missing_revision_ids().
3522
3523 :param find_ghosts: Find ghosts throughout the ancestry of
3524 revision_id.
3525 """
3526 if not find_ghosts and revision_id is not None:
3527 return self._walk_to_common_revisions([revision_id])
3528 elif revision_id is not None:
3529 # Find ghosts: search for revisions pointing from one repository to
3530 # the other, and vice versa, anywhere in the history of revision_id.
3531 graph = self.target.get_graph(other_repository=self.source)
3532 searcher = graph._make_breadth_first_searcher([revision_id])
3533 found_ids = set()
3534 while True:
3535 try:
3536 next_revs, ghosts = searcher.next_with_ghosts()
3537 except StopIteration:
3538 break
3539 if revision_id in ghosts:
3540 raise errors.NoSuchRevision(self.source, revision_id)
3541 found_ids.update(next_revs)
3542 found_ids.update(ghosts)
3543 found_ids = frozenset(found_ids)
3544 # Double query here: should be able to avoid this by changing the
3545 # graph api further.
3546 result_set = found_ids - frozenset(
3547 self.target.get_parent_map(found_ids))
3548 else:
3549 source_ids = self.source.all_revision_ids()
3550 # source_ids is the worst possible case we may need to pull.
3551 # now we want to filter source_ids against what we actually
3552 # have in target, but don't try to check for existence where we know
3553 # we do not have a revision as that would be pointless.
3554 target_ids = set(self.target.all_revision_ids())
3555 result_set = set(source_ids).difference(target_ids)
3556 return self.source.revision_ids_to_search_result(result_set)
3557
3558
3559class InterDifferingSerializer(InterRepository):3444class InterDifferingSerializer(InterRepository):
35603445
3561 @classmethod3446 @classmethod
@@ -3836,7 +3721,6 @@
3836InterRepository.register_optimiser(InterSameDataRepository)3721InterRepository.register_optimiser(InterSameDataRepository)
3837InterRepository.register_optimiser(InterWeaveRepo)3722InterRepository.register_optimiser(InterWeaveRepo)
3838InterRepository.register_optimiser(InterKnitRepo)3723InterRepository.register_optimiser(InterKnitRepo)
3839InterRepository.register_optimiser(InterPackRepo)
38403724
38413725
3842class CopyConverter(object):3726class CopyConverter(object):
38433727
=== modified file 'bzrlib/tests/test_pack_repository.py'
--- bzrlib/tests/test_pack_repository.py 2009-06-10 03:56:49 +0000
+++ bzrlib/tests/test_pack_repository.py 2009-06-16 02:36:36 +0000
@@ -38,6 +38,10 @@
38 upgrade,38 upgrade,
39 workingtree,39 workingtree,
40 )40 )
41from bzrlib.repofmt import (
42 pack_repo,
43 groupcompress_repo,
44 )
41from bzrlib.repofmt.groupcompress_repo import RepositoryFormatCHK145from bzrlib.repofmt.groupcompress_repo import RepositoryFormatCHK1
42from bzrlib.smart import (46from bzrlib.smart import (
43 client,47 client,
@@ -556,58 +560,43 @@
556 missing_ghost.get_inventory, 'ghost')560 missing_ghost.get_inventory, 'ghost')
557561
558 def make_write_ready_repo(self):562 def make_write_ready_repo(self):
559 repo = self.make_repository('.', format=self.get_format())563 format = self.get_format()
564 if isinstance(format.repository_format, RepositoryFormatCHK1):
565 raise TestNotApplicable("No missing compression parents")
566 repo = self.make_repository('.', format=format)
560 repo.lock_write()567 repo.lock_write()
568 self.addCleanup(repo.unlock)
561 repo.start_write_group()569 repo.start_write_group()
570 self.addCleanup(repo.abort_write_group)
562 return repo571 return repo
563572
564 def test_missing_inventories_compression_parent_prevents_commit(self):573 def test_missing_inventories_compression_parent_prevents_commit(self):
565 repo = self.make_write_ready_repo()574 repo = self.make_write_ready_repo()
566 key = ('junk',)575 key = ('junk',)
567 if not getattr(repo.inventories._index, '_missing_compression_parents',
568 None):
569 raise TestSkipped("No missing compression parents")
570 repo.inventories._index._missing_compression_parents.add(key)576 repo.inventories._index._missing_compression_parents.add(key)
571 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)577 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
572 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)578 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
573 repo.abort_write_group()
574 repo.unlock()
575579
576 def test_missing_revisions_compression_parent_prevents_commit(self):580 def test_missing_revisions_compression_parent_prevents_commit(self):
577 repo = self.make_write_ready_repo()581 repo = self.make_write_ready_repo()
578 key = ('junk',)582 key = ('junk',)
579 if not getattr(repo.inventories._index, '_missing_compression_parents',
580 None):
581 raise TestSkipped("No missing compression parents")
582 repo.revisions._index._missing_compression_parents.add(key)583 repo.revisions._index._missing_compression_parents.add(key)
583 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)584 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
584 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)585 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
585 repo.abort_write_group()
586 repo.unlock()
587586
588 def test_missing_signatures_compression_parent_prevents_commit(self):587 def test_missing_signatures_compression_parent_prevents_commit(self):
589 repo = self.make_write_ready_repo()588 repo = self.make_write_ready_repo()
590 key = ('junk',)589 key = ('junk',)
591 if not getattr(repo.inventories._index, '_missing_compression_parents',
592 None):
593 raise TestSkipped("No missing compression parents")
594 repo.signatures._index._missing_compression_parents.add(key)590 repo.signatures._index._missing_compression_parents.add(key)
595 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)591 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
596 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)592 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
597 repo.abort_write_group()
598 repo.unlock()
599593
600 def test_missing_text_compression_parent_prevents_commit(self):594 def test_missing_text_compression_parent_prevents_commit(self):
601 repo = self.make_write_ready_repo()595 repo = self.make_write_ready_repo()
602 key = ('some', 'junk')596 key = ('some', 'junk')
603 if not getattr(repo.inventories._index, '_missing_compression_parents',
604 None):
605 raise TestSkipped("No missing compression parents")
606 repo.texts._index._missing_compression_parents.add(key)597 repo.texts._index._missing_compression_parents.add(key)
607 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)598 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
608 e = self.assertRaises(errors.BzrCheckError, repo.commit_write_group)599 e = self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
609 repo.abort_write_group()
610 repo.unlock()
611600
612 def test_supports_external_lookups(self):601 def test_supports_external_lookups(self):
613 repo = self.make_repository('.', format=self.get_format())602 repo = self.make_repository('.', format=self.get_format())
614603
=== modified file 'bzrlib/tests/test_repository.py'
--- bzrlib/tests/test_repository.py 2009-06-10 03:56:49 +0000
+++ bzrlib/tests/test_repository.py 2009-06-16 02:36:36 +0000
@@ -31,7 +31,10 @@
31 UnknownFormatError,31 UnknownFormatError,
32 UnsupportedFormatError,32 UnsupportedFormatError,
33 )33 )
34from bzrlib import graph34from bzrlib import (
35 graph,
36 tests,
37 )
35from bzrlib.branchbuilder import BranchBuilder38from bzrlib.branchbuilder import BranchBuilder
36from bzrlib.btree_index import BTreeBuilder, BTreeGraphIndex39from bzrlib.btree_index import BTreeBuilder, BTreeGraphIndex
37from bzrlib.index import GraphIndex, InMemoryGraphIndex40from bzrlib.index import GraphIndex, InMemoryGraphIndex
@@ -685,6 +688,147 @@
685 self.assertEqual(65536,688 self.assertEqual(65536,
686 inv.parent_id_basename_to_file_id._root_node.maximum_size)689 inv.parent_id_basename_to_file_id._root_node.maximum_size)
687690
691 def test_stream_source_to_gc(self):
692 source = self.make_repository('source', format='development6-rich-root')
693 target = self.make_repository('target', format='development6-rich-root')
694 stream = source._get_source(target._format)
695 self.assertIsInstance(stream, groupcompress_repo.GroupCHKStreamSource)
696
697 def test_stream_source_to_non_gc(self):
698 source = self.make_repository('source', format='development6-rich-root')
699 target = self.make_repository('target', format='rich-root-pack')
700 stream = source._get_source(target._format)
701 # We don't want the child GroupCHKStreamSource
702 self.assertIs(type(stream), repository.StreamSource)
703
704 def test_get_stream_for_missing_keys_includes_all_chk_refs(self):
705 source_builder = self.make_branch_builder('source',
706 format='development6-rich-root')
707 # We have to build a fairly large tree, so that we are sure the chk
708 # pages will have split into multiple pages.
709 entries = [('add', ('', 'a-root-id', 'directory', None))]
710 for i in 'abcdefghijklmnopqrstuvwxyz123456789':
711 for j in 'abcdefghijklmnopqrstuvwxyz123456789':
712 fname = i + j
713 fid = fname + '-id'
714 content = 'content for %s\n' % (fname,)
715 entries.append(('add', (fname, fid, 'file', content)))
716 source_builder.start_series()
717 source_builder.build_snapshot('rev-1', None, entries)
718 # Now change a few of them, so we get a few new pages for the second
719 # revision
720 source_builder.build_snapshot('rev-2', ['rev-1'], [
721 ('modify', ('aa-id', 'new content for aa-id\n')),
722 ('modify', ('cc-id', 'new content for cc-id\n')),
723 ('modify', ('zz-id', 'new content for zz-id\n')),
724 ])
725 source_builder.finish_series()
726 source_branch = source_builder.get_branch()
727 source_branch.lock_read()
728 self.addCleanup(source_branch.unlock)
729 target = self.make_repository('target', format='development6-rich-root')
730 source = source_branch.repository._get_source(target._format)
731 self.assertIsInstance(source, groupcompress_repo.GroupCHKStreamSource)
732
733 # On a regular pass, getting the inventories and chk pages for rev-2
734 # would only get the newly created chk pages
735 search = graph.SearchResult(set(['rev-2']), set(['rev-1']), 1,
736 set(['rev-2']))
737 simple_chk_records = []
738 for vf_name, substream in source.get_stream(search):
739 if vf_name == 'chk_bytes':
740 for record in substream:
741 simple_chk_records.append(record.key)
742 else:
743 for _ in substream:
744 continue
745 # 3 pages, the root (InternalNode), + 2 pages which actually changed
746 self.assertEqual([('sha1:91481f539e802c76542ea5e4c83ad416bf219f73',),
747 ('sha1:4ff91971043668583985aec83f4f0ab10a907d3f',),
748 ('sha1:81e7324507c5ca132eedaf2d8414ee4bb2226187',),
749 ('sha1:b101b7da280596c71a4540e9a1eeba8045985ee0',)],
750 simple_chk_records)
751 # Now, when we do a similar call using 'get_stream_for_missing_keys'
752 # we should get a much larger set of pages.
753 missing = [('inventories', 'rev-2')]
754 full_chk_records = []
755 for vf_name, substream in source.get_stream_for_missing_keys(missing):
756 if vf_name == 'inventories':
757 for record in substream:
758 self.assertEqual(('rev-2',), record.key)
759 elif vf_name == 'chk_bytes':
760 for record in substream:
761 full_chk_records.append(record.key)
762 else:
763 self.fail('Should not be getting a stream of %s' % (vf_name,))
764 # We have 257 records now. This is because we have 1 root page, and 256
765 # leaf pages in a complete listing.
766 self.assertEqual(257, len(full_chk_records))
767 self.assertSubset(simple_chk_records, full_chk_records)
768
769
770class TestKnitPackStreamSource(tests.TestCaseWithMemoryTransport):
771
772 def test_source_to_exact_pack_092(self):
773 source = self.make_repository('source', format='pack-0.92')
774 target = self.make_repository('target', format='pack-0.92')
775 stream_source = source._get_source(target._format)
776 self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
777
778 def test_source_to_exact_pack_rich_root_pack(self):
779 source = self.make_repository('source', format='rich-root-pack')
780 target = self.make_repository('target', format='rich-root-pack')
781 stream_source = source._get_source(target._format)
782 self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
783
784 def test_source_to_exact_pack_19(self):
785 source = self.make_repository('source', format='1.9')
786 target = self.make_repository('target', format='1.9')
787 stream_source = source._get_source(target._format)
788 self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
789
790 def test_source_to_exact_pack_19_rich_root(self):
791 source = self.make_repository('source', format='1.9-rich-root')
792 target = self.make_repository('target', format='1.9-rich-root')
793 stream_source = source._get_source(target._format)
794 self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
795
796 def test_source_to_remote_exact_pack_19(self):
797 trans = self.make_smart_server('target')
798 trans.ensure_base()
799 source = self.make_repository('source', format='1.9')
800 target = self.make_repository('target', format='1.9')
801 target = repository.Repository.open(trans.base)
802 stream_source = source._get_source(target._format)
803 self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
804
805 def test_stream_source_to_non_exact(self):
806 source = self.make_repository('source', format='pack-0.92')
807 target = self.make_repository('target', format='1.9')
808 stream = source._get_source(target._format)
809 self.assertIs(type(stream), repository.StreamSource)
810
811 def test_stream_source_to_non_exact_rich_root(self):
812 source = self.make_repository('source', format='1.9')
813 target = self.make_repository('target', format='1.9-rich-root')
814 stream = source._get_source(target._format)
815 self.assertIs(type(stream), repository.StreamSource)
816
817 def test_source_to_remote_non_exact_pack_19(self):
818 trans = self.make_smart_server('target')
819 trans.ensure_base()
820 source = self.make_repository('source', format='1.9')
821 target = self.make_repository('target', format='1.6')
822 target = repository.Repository.open(trans.base)
823 stream_source = source._get_source(target._format)
824 self.assertIs(type(stream_source), repository.StreamSource)
825
826 def test_stream_source_to_knit(self):
827 source = self.make_repository('source', format='pack-0.92')
828 target = self.make_repository('target', format='dirstate')
829 stream = source._get_source(target._format)
830 self.assertIs(type(stream), repository.StreamSource)
831
688832
689class TestDevelopment6FindParentIdsOfRevisions(TestCaseWithTransport):833class TestDevelopment6FindParentIdsOfRevisions(TestCaseWithTransport):
690 """Tests for _find_parent_ids_of_revisions."""834 """Tests for _find_parent_ids_of_revisions."""
@@ -1204,84 +1348,3 @@
1204 self.assertTrue(new_pack.inventory_index._optimize_for_size)1348 self.assertTrue(new_pack.inventory_index._optimize_for_size)
1205 self.assertTrue(new_pack.text_index._optimize_for_size)1349 self.assertTrue(new_pack.text_index._optimize_for_size)
1206 self.assertTrue(new_pack.signature_index._optimize_for_size)1350 self.assertTrue(new_pack.signature_index._optimize_for_size)
1207
1208
1209class TestGCCHKPackCollection(TestCaseWithTransport):
1210
1211 def test_stream_source_to_gc(self):
1212 source = self.make_repository('source', format='development6-rich-root')
1213 target = self.make_repository('target', format='development6-rich-root')
1214 stream = source._get_source(target._format)
1215 self.assertIsInstance(stream, groupcompress_repo.GroupCHKStreamSource)
1216
1217 def test_stream_source_to_non_gc(self):
1218 source = self.make_repository('source', format='development6-rich-root')
1219 target = self.make_repository('target', format='rich-root-pack')
1220 stream = source._get_source(target._format)
1221 # We don't want the child GroupCHKStreamSource
1222 self.assertIs(type(stream), repository.StreamSource)
1223
1224 def test_get_stream_for_missing_keys_includes_all_chk_refs(self):
1225 source_builder = self.make_branch_builder('source',
1226 format='development6-rich-root')
1227 # We have to build a fairly large tree, so that we are sure the chk
1228 # pages will have split into multiple pages.
1229 entries = [('add', ('', 'a-root-id', 'directory', None))]
1230 for i in 'abcdefghijklmnopqrstuvwxyz123456789':
1231 for j in 'abcdefghijklmnopqrstuvwxyz123456789':
1232 fname = i + j
1233 fid = fname + '-id'
1234 content = 'content for %s\n' % (fname,)
1235 entries.append(('add', (fname, fid, 'file', content)))
1236 source_builder.start_series()
1237 source_builder.build_snapshot('rev-1', None, entries)
1238 # Now change a few of them, so we get a few new pages for the second
1239 # revision
1240 source_builder.build_snapshot('rev-2', ['rev-1'], [
1241 ('modify', ('aa-id', 'new content for aa-id\n')),
1242 ('modify', ('cc-id', 'new content for cc-id\n')),
1243 ('modify', ('zz-id', 'new content for zz-id\n')),
1244 ])
1245 source_builder.finish_series()
1246 source_branch = source_builder.get_branch()
1247 source_branch.lock_read()
1248 self.addCleanup(source_branch.unlock)
1249 target = self.make_repository('target', format='development6-rich-root')
1250 source = source_branch.repository._get_source(target._format)
1251 self.assertIsInstance(source, groupcompress_repo.GroupCHKStreamSource)
1252
1253 # On a regular pass, getting the inventories and chk pages for rev-2
1254 # would only get the newly created chk pages
1255 search = graph.SearchResult(set(['rev-2']), set(['rev-1']), 1,
1256 set(['rev-2']))
1257 simple_chk_records = []
1258 for vf_name, substream in source.get_stream(search):
1259 if vf_name == 'chk_bytes':
1260 for record in substream:
1261 simple_chk_records.append(record.key)
1262 else:
1263 for _ in substream:
1264 continue
1265 # 3 pages, the root (InternalNode), + 2 pages which actually changed
1266 self.assertEqual([('sha1:91481f539e802c76542ea5e4c83ad416bf219f73',),
1267 ('sha1:4ff91971043668583985aec83f4f0ab10a907d3f',),
1268 ('sha1:81e7324507c5ca132eedaf2d8414ee4bb2226187',),
1269 ('sha1:b101b7da280596c71a4540e9a1eeba8045985ee0',)],
1270 simple_chk_records)
1271 # Now, when we do a similar call using 'get_stream_for_missing_keys'
1272 # we should get a much larger set of pages.
1273 missing = [('inventories', 'rev-2')]
1274 full_chk_records = []
1275 for vf_name, substream in source.get_stream_for_missing_keys(missing):
1276 if vf_name == 'inventories':
1277 for record in substream:
1278 self.assertEqual(('rev-2',), record.key)
1279 elif vf_name == 'chk_bytes':
1280 for record in substream:
1281 full_chk_records.append(record.key)
1282 else:
1283 self.fail('Should not be getting a stream of %s' % (vf_name,))
1284 # We have 257 records now. This is because we have 1 root page, and 256
1285 # leaf pages in a complete listing.
1286 self.assertEqual(257, len(full_chk_records))
1287 self.assertSubset(simple_chk_records, full_chk_records)