Bazaar

Merge lp:~lifeless/bzr/autopack-cross-format-fetch-1 into lp:~bzr/bzr/trunk-old

autopack-cross-format-fetch-1
Merge into trunk-old

Proposed by Robert Collins on 2009-06-22

Status:

Merged

Merged at revision:

not available

Proposed branch:

lp:~lifeless/bzr/autopack-cross-format-fetch-1

Merge into:

lp:~bzr/bzr/trunk-old

Diff against target:

None lines

To merge this branch:

bzr merge lp:~lifeless/bzr/autopack-cross-format-fetch-1

Critical

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel		2009-06-22	Approve on 2009-06-22
Review via email: mp+7748@code.launchpad.net

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-22:

This branch causes a partial pack to happen when fetching across
serialisers for both the IDS and streaming code paths.

On the way, I encountered bug 365615 and had to fix it. I'd like John in
particular to review the gc repo changes accordingly - while textually
small the implications are not, and he wrote the GCCHK packer.

-Rob

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-22:

Download full text (3.8 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> Robert Collins has proposed merging lp:~lifeless/bzr/autopack-cross-format-fetch-1 into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This branch causes a partial pack to happen when fetching across
> serialisers for both the IDS and streaming code paths.
>
> On the way, I encountered bug 365615 and had to fix it. I'd like John in
> particular to review the gc repo changes accordingly - while textually
> small the implications are not, and he wrote the GCCHK packer.
>
> -Rob
>
>

...

I assume this is your fix for 365615:

Handling missing pages could be handled in 2 ways. What you have written
should work, and also doing:

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
- --- bzrlib/repofmt/groupcompress_repo.py 2009-06-18 19:19:49 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-06-22 14:07:01 +0000
@@ -320,6 +320,7 @@
                 cur_keys = []
                 for prefix in sorted(keys_by_search_prefix):
                     cur_keys.extend(keys_by_search_prefix.pop(prefix))
+ cur_keys = remaining_keys.intersection(cur_keys)
         for stream in _get_referenced_stream(self._chk_id_roots,
                                              self._gather_text_refs):
             yield stream

In a lot of ways I'd like to get rid of "remaining_keys", though. (It is
very useful for progress indication, but we should be able to use
Index.key_count() for progress.) It just depends if there are any
failure modes that would give us AbsentContentFactory where
'remaining_keys' might say something different.

I'm pretty sure at one point I had something like intersection() but it
probably got factored out, and we didn't have specific tests for it.

Specifically, to trigger this you have to have a "big" inventory.
Something bigger than say 200 items, so that it triggers a split
(otherwise you only have a root node, and no references). And then if
you create 1 pack file with most of the chk pages. Then create another
group of pack files with only the deltas, and then autopack only those
new ones.

It happens all the time in the real world, and never in our test suite...

I also didn't see you add a test that would trigger this. Though
certainly maybe I just missed it.

Anyway the fix seems fine as is, but a test case would be great to
prevent it from regressing.

As for the rest of the fix... I noticed you track "hints" as a list, but
then in the inner pack loop you do:
+ if not hint or pack.name in hint:
+ pack_operations[-1][0] += pack.get_revision_count()
+ pack_operations[-1][1].append(pack)

^- This has moderate implications for IDS when copying an entire
repository. As you can have 100s (1000s?) of pack files c...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> Robert Collins has proposed merging lp:~lifeless/bzr/autopack-cross-format-fetch-1 into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> This branch causes a partial pack to happen when fetching across
> serialisers for both the IDS and streaming code paths.
> 
> On the way, I encountered bug 365615 and had to fix it. I'd like John in
> particular to review the gc repo changes accordingly - while textually
> small the implications are not, and he wrote the GCCHK packer.
> 
> -Rob
> 
>

...

I assume this is your fix for 365615:

for record in stream:
+                        if record.storage_kind == 'absent':
+                            # An absent CHK record: we assume that the
missing
+                            # record is in a different pack - e.g. a
page not
+                            # altered by the commit we're packing.
+                            continue

Handling missing pages could be handled in 2 ways. What you have written
should work, and also doing:

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
- --- bzrlib/repofmt/groupcompress_repo.py        2009-06-18 19:19:49 +0000
+++ bzrlib/repofmt/groupcompress_repo.py        2009-06-22 14:07:01 +0000
@@ -320,6 +320,7 @@
                 cur_keys = []
                 for prefix in sorted(keys_by_search_prefix):
                     cur_keys.extend(keys_by_search_prefix.pop(prefix))
+                cur_keys = remaining_keys.intersection(cur_keys)
         for stream in _get_referenced_stream(self._chk_id_roots,
                                              self._gather_text_refs):
             yield stream

I'm pretty sure at one point I had something like intersection() but it
probably got factored out, and we didn't have specific tests for it.

It happens all the time in the real world, and never in our test suite...

I also didn't see you add a test that would trigger this. Though
certainly maybe I just missed it.

Anyway the fix seems fine as is, but a test case would be great to
prevent it from regressing.

As for the rest of the fix... I noticed you track "hints" as a list, but
then in the inner pack loop you do:
+            if not hint or pack.name in hint:
+                pack_operations[-1][0] += pack.get_revision_count()
+                pack_operations[-1][1].append(pack)

^- This has moderate implications for IDS when copying an entire
repository. As you can have 100s (1000s?) of pack files created, and
then you end up with 10s-100s after conversion. (So it does 100x1000
comparisons.)

I think you can just do:

if hint is not None:
  hint = set(hint)

if not hint or pack.name in hint:

Either that, or do the set() operation in IDS. Otherwise it looks good.

review: approve

(I was told that 'tweak' maps up to :approve, at least in the bug report
about wanting 'tweak')

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko/knoACgkQJdeBCYSNAANyKACgy7BYtoDrGBEliM+mKUlFzCmU
GqcAoLn9jyt7vhezbiXfZllGuiQuxWEC
=fl39
-----END PGP SIGNATURE-----

review: Approve

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-22:

On Mon, 2009-06-22 at 14:21 +0000, John A Meinel wrote:

Yes.

> Handling missing pages could be handled in 2 ways. What you have written
> should work, and also doing:
>
> === modified file 'bzrlib/repofmt/groupcompress_repo.py'
> - --- bzrlib/repofmt/groupcompress_repo.py 2009-06-18 19:19:49 +0000
> +++ bzrlib/repofmt/groupcompress_repo.py 2009-06-22 14:07:01 +0000
> @@ -320,6 +320,7 @@
> cur_keys = []
> for prefix in sorted(keys_by_search_prefix):
> cur_keys.extend(keys_by_search_prefix.pop(prefix))
> + cur_keys = remaining_keys.intersection(cur_keys)
> for stream in _get_referenced_stream(self._chk_id_roots,
> self._gather_text_refs):
> yield stream
>

I prefer the above, and will discard the change I made if the above
works - I'll give it a shot asap.

> I'm pretty sure at one point I had something like intersection() but it
> probably got factored out, and we didn't have specific tests for it.
>
> Specifically, to trigger this you have to have a "big" inventory.
> Something bigger than say 200 items, so that it triggers a split
> (otherwise you only have a root node, and no references). And then if
> you create 1 pack file with most of the chk pages. Then create another
> group of pack files with only the deltas, and then autopack only those
> new ones.

Actually, you can test it by doing a simple loop on tree.commit('m'),
because all the nodes will be common. 20 no change commits in a row will
trigger it (the first autopack will include the chk pages, but the
second won't). My test for partial explicit packs was what triggered it
for me.

> It happens all the time in the real world, and never in our test suite...
>
> I also didn't see you add a test that would trigger this. Though
> certainly maybe I just missed it.

Not as a dedicated thing; I'll add one for clarity. I'll also setify
hint.

-Rob

On Mon, 2009-06-22 at 14:21 +0000, John A Meinel wrote:

> I assume this is your fix for 365615:
> 
>                      for record in stream:
> +                        if record.storage_kind == 'absent':
> +                            # An absent CHK record: we assume that the
> missing
> +                            # record is in a different pack - e.g. a
> page not
> +                            # altered by the commit we're packing.
> +                            continue

Yes.

> Handling missing pages could be handled in 2 ways. What you have written
> should work, and also doing:
> 
> === modified file 'bzrlib/repofmt/groupcompress_repo.py'
> - --- bzrlib/repofmt/groupcompress_repo.py        2009-06-18 19:19:49 +0000
> +++ bzrlib/repofmt/groupcompress_repo.py        2009-06-22 14:07:01 +0000
> @@ -320,6 +320,7 @@
>                  cur_keys = []
>                  for prefix in sorted(keys_by_search_prefix):
>                      cur_keys.extend(keys_by_search_prefix.pop(prefix))
> +                cur_keys = remaining_keys.intersection(cur_keys)
>          for stream in _get_referenced_stream(self._chk_id_roots,
>                                               self._gather_text_refs):
>              yield stream
>

I prefer the above, and will discard the change I made if the above
works - I'll give it a shot asap.

> I'm pretty sure at one point I had something like intersection() but it
> probably got factored out, and we didn't have specific tests for it.
> 
> Specifically, to trigger this you have to have a "big" inventory.
> Something bigger than say 200 items, so that it triggers a split
> (otherwise you only have a root node, and no references). And then if
> you create 1 pack file with most of the chk pages. Then create another
> group of pack files with only the deltas, and then autopack only those
> new ones.

> It happens all the time in the real world, and never in our test suite...
> 
> I also didn't see you add a test that would trigger this. Though
> certainly maybe I just missed it.

Not as a dedicated thing; I'll add one for clarity. I'll also setify
hint.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-22:

On Mon, 2009-06-22 at 14:21 +0000, John A Meinel wrote:
>
> cur_keys.extend(keys_by_search_prefix.pop(prefix))
> + cur_keys = remaining_keys.intersection(cur_keys)
> for stream in _get_referenced_stream(self._chk_id_roots,
> self._gather_text_refs):

Just to note; this doesn't work.

I think its because the root pages can be inaccessible too, and this is
too late.

For now I've gone with continue on 'absent' records.

-Rob

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-23:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2009-06-22 at 14:21 +0000, John A Meinel wrote:
>> cur_keys.extend(keys_by_search_prefix.pop(prefix))
>> + cur_keys = remaining_keys.intersection(cur_keys)
>> for stream in _get_referenced_stream(self._chk_id_roots,
>> self._gather_text_refs):
>
> Just to note; this doesn't work.
>
> I think its because the root pages can be inaccessible too, and this is
> too late.
>
> For now I've gone with continue on 'absent' records.
>
> -Rob
>

Well, it would work for cases that aren't 20 no-op cases :), since that
is the only time the root page is unchanged.

But certainly, we can live with this for now, but I think you could
actually filter earlier with a similar .intersection check. Namely:

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
- --- bzrlib/repofmt/groupcompress_repo.py 2009-06-22 15:13:45 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-06-23 03:14:58 +0000
@@ -320,6 +320,10 @@
                 cur_keys = []
                 for prefix in sorted(keys_by_search_prefix):
                     cur_keys.extend(keys_by_search_prefix.pop(prefix))
+ cur_keys = [key for key in cur_keys if key in
remaining_keys]
+ remaining_keys.intersection(cur_keys)
+ self._chk_id_roots = [key for key in self._chk_id_roots
+ if key in remaining_keys]
         for stream in _get_referenced_stream(self._chk_id_roots,
                                              self._gather_text_refs):
             yield stream

Looking at the code, I saw that I actually did this for 'pid_roots'
since they are likely to not have been changed.

The reason you don't actually want ".intersection" is because that ends
up returning a set, which ends up breaking the ordering that we had
carefully crafted to get optimal compression. (namely, grouping by the
search key prefix.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpASPMACgkQJdeBCYSNAAMkTwCgwG2oW72ofNjLzB8qGHr1xzI1
P0oAoKm2IiPcNmIRra2eGezklFjk6GrS
=gUcs
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2009-06-22 at 14:21 +0000, John A Meinel wrote:
>> cur_keys.extend(keys_by_search_prefix.pop(prefix))
>> +                cur_keys = remaining_keys.intersection(cur_keys)
>>          for stream in _get_referenced_stream(self._chk_id_roots,
>>                                               self._gather_text_refs):
> 
> Just to note; this doesn't work.
> 
> I think its because the root pages can be inaccessible too, and this is
> too late.
> 
> For now I've gone with continue on 'absent' records.
> 
> -Rob
>

Well, it would work for cases that aren't 20 no-op cases :), since that
is the only time the root page is unchanged.

But certainly, we can live with this for now, but I think you could
actually filter earlier with a similar .intersection check. Namely:

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
- --- bzrlib/repofmt/groupcompress_repo.py        2009-06-22 15:13:45 +0000
+++ bzrlib/repofmt/groupcompress_repo.py        2009-06-23 03:14:58 +0000
@@ -320,6 +320,10 @@
                 cur_keys = []
                 for prefix in sorted(keys_by_search_prefix):
                     cur_keys.extend(keys_by_search_prefix.pop(prefix))
+                cur_keys = [key for key in cur_keys if key in
remaining_keys]
+                remaining_keys.intersection(cur_keys)
+        self._chk_id_roots = [key for key in self._chk_id_roots
+                                   if key in remaining_keys]
         for stream in _get_referenced_stream(self._chk_id_roots,
                                              self._gather_text_refs):
             yield stream

Looking at the code, I saw that I actually did this for 'pid_roots'
since they are likely to not have been changed.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpASPMACgkQJdeBCYSNAAMkTwCgwG2oW72ofNjLzB8qGHr1xzI1
P0oAoKm2IiPcNmIRra2eGezklFjk6GrS
=gUcs
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Denys Duchier

Eric Siegerman

Gary van der Merwe

Jelmer Vernooij

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

Robert Collins

SuperMMX

Talden

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Vincent Ladeuil

Bazaar

Merge lp:~lifeless/bzr/autopack-cross-format-fetch-1 into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'NEWS'
 --- NEWS	2009-06-18 19:19:49 +0000
 +++ NEWS	2009-06-22 06:14:38 +0000
@@ -42,6 +42,10 @@
    ``BZR_PROGRESS_BAR`` is set to ``none``.
    (Martin Pool, #339385)
++* Repositories using CHK pages (which includes the new 2a format) will no
++  longer error during commit or push operations when an autopack operation
++  is triggered. (Robert Collins, #365615)
++
  Internals
  *********
@@ -58,11 +62,32 @@
    for files with long ancestry and 'cherrypicked' changes.)
    (John Arbash Meinel)
++* ``GroupCompress`` repositories now take advantage of the pack hints
++  parameter to permit cross-format fetching to incrementally pack the
++  converted data. (Robert Collins)
++
  * pack <=> pack fetching is now done via a ``PackStreamSource`` rather
    than the ``Packer`` code. The user visible change is that we now
    properly fetch the minimum number of texts for non-smart fetching.
    (John Arbash Meinel)
++* ``Repository.commit_write_group`` now returns opaque data about what
++  was committed, for passing to the ``Repository.pack``. Repositories
++  without atomic commits will still return None. (Robert Collins)
++
++* ``Repository.pack`` now takes an optional ``hint`` parameter
++  which will support doing partial packs for repositories that can do
++  that. (Robert Collins)
++
++* RepositoryFormat has a new attribute 'pack_compresses' which is True
++  when doing a pack operation changes the compression of content in the
++  repository. (Robert Collins)
++
++* ``StreamSink`` and ``InterDifferingSerialiser`` will call
++  ``Repository.pack`` with the hint returned by
++  ``Repository.commit_write_group`` if the formats were different and the
++  repository can increase compression by doing a pack operation.
++  (Robert Collins, #376748)
  Improvements
  ************
 === modified file 'bzrlib/remote.py'
 --- bzrlib/remote.py	2009-06-17 03:53:51 +0000
 +++ bzrlib/remote.py	2009-06-21 23:51:17 +0000
@@ -566,6 +566,11 @@
          return self._creating_repo._real_repository._format.network_name()
      @property
++    def pack_compresses(self):
++        self._ensure_real()
++        return self._custom_format.pack_compresses
++
++    @property
      def _serializer(self):
          self._ensure_real()
          return self._custom_format._serializer
@@ -1491,13 +1496,13 @@
          return self._real_repository.inventories
      @needs_write_lock
--    def pack(self):
++    def pack(self, hint=None):
          """Compress the data within the repository.
          This is not currently implemented within the smart server.
          """
          self._ensure_real()
--        return self._real_repository.pack()
++        return self._real_repository.pack(hint=hint)
      @property
      def revisions(self):
 === modified file 'bzrlib/repofmt/groupcompress_repo.py'
 --- bzrlib/repofmt/groupcompress_repo.py	2009-06-18 19:19:49 +0000
 +++ bzrlib/repofmt/groupcompress_repo.py	2009-06-22 04:56:21 +0000
@@ -218,6 +218,7 @@
              p_id_roots_set = set()
              stream = source_vf.get_record_stream(keys, 'groupcompress', True)
              for idx, record in enumerate(stream):
++                # Inventories should always be with revisions; assume success.
                  bytes = record.get_bytes_as('fulltext')
                  chk_inv = inventory.CHKInventory.deserialise(None, bytes,
                                                               record.key)
@@ -294,6 +295,11 @@
                      stream = source_vf.get_record_stream(cur_keys,
                                                           'as-requested', True)
                      for record in stream:
++                        if record.storage_kind == 'absent':
++                            # An absent CHK record: we assume that the missing
++                            # record is in a different pack - e.g. a page not
++                            # altered by the commit we're packing.
++                            continue
                          bytes = record.get_bytes_as('fulltext')
                          # We don't care about search_key_func for this code,
                          # because we only care about external references.
@@ -558,11 +564,6 @@
      pack_factory = GCPack
      resumed_pack_factory = ResumedGCPack
--    def _already_packed(self):
--        """Is the collection already packed?"""
--        # Always repack GC repositories for now
--        return False
--
      def _execute_pack_operations(self, pack_operations,
                                   _packer_class=GCCHKPacker,
                                   reload_func=None):
@@ -1048,6 +1049,7 @@
      _fetch_order = 'unordered'
      _fetch_uses_deltas = False # essentially ignored by the groupcompress code.
      fast_deltas = True
++    pack_compresses = True
      def _get_matching_bzrdir(self):
          return bzrdir.format_registry.make_bzrdir('development6-rich-root')
 === modified file 'bzrlib/repofmt/pack_repo.py'
 --- bzrlib/repofmt/pack_repo.py	2009-06-17 17:57:15 +0000
 +++ bzrlib/repofmt/pack_repo.py	2009-06-22 04:56:21 +0000
@@ -1459,12 +1459,12 @@
          in synchronisation with certain steps. Otherwise the names collection
          is not flushed.
--        :return: True if packing took place.
++        :return: Something evaluating true if packing took place.
          """
          while True:
              try:
                  return self._do_autopack()
--            except errors.RetryAutopack, e:
++            except errors.RetryAutopack:
                  # If we get a RetryAutopack exception, we should abort the
                  # current action, and retry.
                  pass
@@ -1474,7 +1474,7 @@
          total_revisions = self.revision_index.combined_index.key_count()
          total_packs = len(self._names)
          if self._max_pack_count(total_revisions) >= total_packs:
--            return False
++            return None
          # determine which packs need changing
          pack_distribution = self.pack_distribution(total_revisions)
          existing_packs = []
@@ -1502,10 +1502,10 @@
              'containing %d revisions. Packing %d files into %d affecting %d'
              ' revisions', self, total_packs, total_revisions, num_old_packs,
              num_new_packs, num_revs_affected)
--        self._execute_pack_operations(pack_operations,
++        result = self._execute_pack_operations(pack_operations,
                                        reload_func=self._restart_autopack)
          mutter('Auto-packing repository %s completed', self)
--        return True
++        return result
      def _execute_pack_operations(self, pack_operations, _packer_class=Packer,
                                   reload_func=None):
@@ -1513,7 +1513,7 @@
          :param pack_operations: A list of [revision_count, packs_to_combine].
          :param _packer_class: The class of packer to use (default: Packer).
--        :return: None.
++        :return: The new pack names.
          """
          for revision_count, packs in pack_operations:
              # we may have no-ops from the setup logic
@@ -1535,10 +1535,11 @@
                  self._remove_pack_from_memory(pack)
          # record the newly available packs and stop advertising the old
          # packs
--        self._save_pack_names(clear_obsolete_packs=True)
++        result = self._save_pack_names(clear_obsolete_packs=True)
          # Move the old packs out of the way now they are no longer referenced.
          for revision_count, packs in pack_operations:
              self._obsolete_packs(packs)
++        return result
      def _flush_new_pack(self):
          if self._new_pack is not None:
@@ -1554,29 +1555,26 @@
      def _already_packed(self):
          """Is the collection already packed?"""
--        return len(self._names) < 2
++        return not (self.repo._format.pack_compresses or (len(self._names) > 1))
--    def pack(self):
++    def pack(self, hint=None):
          """Pack the pack collection totally."""
          self.ensure_loaded()
          total_packs = len(self._names)
          if self._already_packed():
--            # This is arguably wrong because we might not be optimal, but for
--            # now lets leave it in. (e.g. reconcile -> one pack. But not
--            # optimal.
              return
          total_revisions = self.revision_index.combined_index.key_count()
          # XXX: the following may want to be a class, to pack with a given
          # policy.
          mutter('Packing repository %s, which has %d pack files, '
--            'containing %d revisions into 1 packs.', self, total_packs,
--            total_revisions)
++            'containing %d revisions with hint %r.', self, total_packs,
++            total_revisions, hint)
          # determine which packs need changing
--        pack_distribution = [1]
          pack_operations = [[0, []]]
          for pack in self.all_packs():
--            pack_operations[-1][0] += pack.get_revision_count()
--            pack_operations[-1][1].append(pack)
++            if not hint or pack.name in hint:
++                pack_operations[-1][0] += pack.get_revision_count()
++                pack_operations[-1][1].append(pack)
          self._execute_pack_operations(pack_operations, OptimisingPacker)
      def plan_autopack_combinations(self, existing_packs, pack_distribution):
@@ -1938,6 +1936,7 @@
          :param clear_obsolete_packs: If True, clear out the contents of the
              obsolete_packs directory.
++        :return: A list of the names saved that were not previously on disk.
          """
          self.lock_names()
          try:
@@ -1958,6 +1957,7 @@
              self._unlock_names()
          # synchronise the memory packs list with what we just wrote:
          self._syncronize_pack_names_from_disk_nodes(disk_nodes)
++        return [new_node[0][0] for new_node in new_nodes]
      def reload_pack_names(self):
          """Sync our pack listing with what is present in the repository.
@@ -2097,7 +2097,7 @@
              if not self.autopack():
                  # when autopack takes no steps, the names list is still
                  # unsaved.
--                self._save_pack_names()
++                return self._save_pack_names()
      def _suspend_write_group(self):
          tokens = [pack.name for pack in self._resumed_packs]
@@ -2348,13 +2348,13 @@
          raise NotImplementedError(self.dont_leave_lock_in_place)
      @needs_write_lock
--    def pack(self):
++    def pack(self, hint=None):
          """Compress the data within the repository.
          This will pack all the data to a single pack. In future it may
          recompress deltas or do other such expensive operations.
          """
--        self._pack_collection.pack()
++        self._pack_collection.pack(hint=hint)
      @needs_write_lock
      def reconcile(self, other=None, thorough=False):
 === modified file 'bzrlib/repository.py'
 --- bzrlib/repository.py	2009-06-17 17:57:15 +0000
 +++ bzrlib/repository.py	2009-06-22 06:14:38 +0000
@@ -1413,8 +1413,9 @@
              raise errors.BzrError('mismatched lock context %r and '
                  'write group %r.' %
                  (self.get_transaction(), self._write_group))
--        self._commit_write_group()
++        result = self._commit_write_group()
          self._write_group = None
++        return result
      def _commit_write_group(self):
          """Template method for per-repository write group cleanup.
@@ -2427,7 +2428,7 @@
              keys = tsort.topo_sort(parent_map)
          return [None] + list(keys)
--    def pack(self):
++    def pack(self, hint=None):
          """Compress the data within the repository.
          This operation only makes sense for some repository types. For other
@@ -2436,6 +2437,13 @@
          This stub method does not require a lock, but subclasses should use
          @needs_write_lock as this is a long running call its reasonable to
          implicitly lock for the user.
++
++        :param hint: If not supplied, the whole repository is packed.
++            If supplied, the repository may use the hint parameter as a
++            hint for the parts of the repository to pack. A hint can be
++            obtained from the result of commit_write_group(). Out of
++            date hints are simply ignored, because concurrent operations
++            can obsolete them rapidly.
          """
      def get_transaction(self):
@@ -2844,6 +2852,11 @@
      # Does this format have < O(tree_size) delta generation. Used to hint what
      # code path for commit, amongst other things.
      fast_deltas = None
++    # Does doing a pack operation compress data? Useful for the pack UI command
++    # (so if there is one pack, the operation can still proceed because it may
++    # help), and for fetching when data won't have come from the same
++    # compressor.
++    pack_compresses = False
      def __str__(self):
          return "<%s>" % self.__class__.__name__
@@ -3675,6 +3688,7 @@
          cache = lru_cache.LRUCache(100)
          cache[basis_id] = basis_tree
          del basis_tree # We don't want to hang on to it here
++        hints = []
          for offset in range(0, len(revision_ids), batch_size):
              self.target.start_write_group()
              try:
@@ -3686,7 +3700,11 @@
                  self.target.abort_write_group()
                  raise
              else:
--                self.target.commit_write_group()
++                hint = self.target.commit_write_group()
++                if hint:
++                    hints.extend(hint)
++        if hints and self.target._format.pack_compresses:
++            self.target.pack(hint=hints)
          pb.update('Transferring revisions', len(revision_ids),
                    len(revision_ids))
@@ -4034,7 +4052,10 @@
                  # missing keys can handle suspending a write group).
                  write_group_tokens = self.target_repo.suspend_write_group()
                  return write_group_tokens, missing_keys
--        self.target_repo.commit_write_group()
++        hint = self.target_repo.commit_write_group()
++        if (to_serializer != src_serializer and
++            self.target_repo._format.pack_compresses):
++            self.target_repo.pack(hint=hint)
          return [], set()
      def _extract_and_insert_inventories(self, substream, serializer):
 === modified file 'bzrlib/tests/per_repository/test_pack.py'
 --- bzrlib/tests/per_repository/test_pack.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/tests/per_repository/test_pack.py	2009-06-21 23:51:17 +0000
@@ -24,3 +24,14 @@
      def test_pack_empty_does_not_error(self):
          repo = self.make_repository('.')
          repo.pack()
++
++    def test_pack_accepts_opaque_hint(self):
++        # For requesting packs of a repository where some data is known to be
++        # unoptimal we permit packing just some data via a hint. If the hint is
++        # illegible it is ignored.
++        tree = self.make_branch_and_tree('tree')
++        rev1 = tree.commit('1')
++        rev2 = tree.commit('2')
++        rev3 = tree.commit('3')
++        rev4 = tree.commit('4')
++        tree.branch.repository.pack(hint=[rev3, rev4])
 === modified file 'bzrlib/tests/per_repository/test_repository.py'
 --- bzrlib/tests/per_repository/test_repository.py	2009-06-17 21:33:03 +0000
 +++ bzrlib/tests/per_repository/test_repository.py	2009-06-19 04:19:22 +0000
@@ -66,29 +66,29 @@
  class TestRepository(TestCaseWithRepository):
++    def assertFormatAttribute(self, attribute, allowed_values):
++        """Assert that the format has an attribute 'attribute'."""
++        repo = self.make_repository('repo')
++        self.assertSubset([getattr(repo._format, attribute)], allowed_values)
++
      def test_attribute__fetch_order(self):
          """Test the the _fetch_order attribute."""
--        tree = self.make_branch_and_tree('tree')
--        repo = tree.branch.repository
--        self.assertTrue(repo._format._fetch_order in ('topological', 'unordered'))
++        self.assertFormatAttribute('_fetch_order', ('topological', 'unordered'))
      def test_attribute__fetch_uses_deltas(self):
          """Test the the _fetch_uses_deltas attribute."""
--        tree = self.make_branch_and_tree('tree')
--        repo = tree.branch.repository
--        self.assertTrue(repo._format._fetch_uses_deltas in (True, False))
++        self.assertFormatAttribute('_fetch_uses_deltas', (True, False))
      def test_attribute_fast_deltas(self):
          """Test the format.fast_deltas attribute."""
--        tree = self.make_branch_and_tree('tree')
--        repo = tree.branch.repository
--        self.assertTrue(repo._format.fast_deltas in (True, False))
++        self.assertFormatAttribute('fast_deltas', (True, False))
      def test_attribute__fetch_reconcile(self):
          """Test the the _fetch_reconcile attribute."""
--        tree = self.make_branch_and_tree('tree')
--        repo = tree.branch.repository
--        self.assertTrue(repo._format._fetch_reconcile in (True, False))
++        self.assertFormatAttribute('_fetch_reconcile', (True, False))
++
++    def test_attribute_format_pack_compresses(self):
++        self.assertFormatAttribute('pack_compresses', (True, False))
      def test_attribute_inventories_store(self):
          """Test the existence of the inventories attribute."""
 === modified file 'bzrlib/tests/per_repository/test_write_group.py'
 --- bzrlib/tests/per_repository/test_write_group.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/tests/per_repository/test_write_group.py	2009-06-22 02:25:09 +0000
@@ -68,11 +68,14 @@
              repo.commit_write_group()
              repo.unlock()
--    def test_commit_write_group_gets_None(self):
++    def test_commit_write_group_does_not_error(self):
          repo = self.make_repository('.')
          repo.lock_write()
          repo.start_write_group()
--        self.assertEqual(None, repo.commit_write_group())
++        # commit_write_group can either return None (for repositories without
++        # isolated transactions) or a hint for pack(). So we only check it
++        # works in this interface test, because all repositories are exercised.
++        repo.commit_write_group()
          repo.unlock()
      def test_unlock_in_write_group(self):
 === modified file 'bzrlib/tests/test_pack_repository.py'
 --- bzrlib/tests/test_pack_repository.py	2009-06-17 17:57:15 +0000
 +++ bzrlib/tests/test_pack_repository.py	2009-06-22 02:25:09 +0000
@@ -238,6 +238,35 @@
          pack_names = [node[1][0] for node in index.iter_all_entries()]
          self.assertTrue(large_pack_name in pack_names)
++    def test_commit_write_group_returns_new_pack_names(self):
++        format = self.get_format()
++        tree = self.make_branch_and_tree('foo', format=format)
++        tree.commit('first post')
++        repo = tree.branch.repository
++        repo.lock_write()
++        try:
++            repo.start_write_group()
++            try:
++                inv = inventory.Inventory(revision_id="A")
++                inv.root.revision = "A"
++                repo.texts.add_lines((inv.root.file_id, "A"), [], [])
++                rev = _mod_revision.Revision(timestamp=0, timezone=None,
++                    committer="Foo Bar <foo@example.com>", message="Message",
++                    revision_id="A")
++                rev.parent_ids = ()
++                repo.add_revision("A", rev, inv=inv)
++            except:
++                repo.abort_write_group()
++                raise
++            else:
++                old_names = repo._pack_collection._names.keys()
++                result = repo.commit_write_group()
++                cur_names = repo._pack_collection._names.keys()
++                new_names = list(set(cur_names) - set(old_names))
++                self.assertEqual(new_names, result)
++        finally:
++            repo.unlock()
++
      def test_fail_obsolete_deletion(self):
          # failing to delete obsolete packs is not fatal
          format = self.get_format()
 === modified file 'bzrlib/tests/test_repository.py'
 --- bzrlib/tests/test_repository.py	2009-06-18 18:00:01 +0000
 +++ bzrlib/tests/test_repository.py	2009-06-22 06:15:41 +0000
@@ -673,10 +673,14 @@
          self.assertFalse(repo._format.supports_external_lookups)
--class TestDevelopment6(TestCaseWithTransport):
++class Test2a(TestCaseWithTransport):
++
++    def test_format_pack_compresses_True(self):
++        repo = self.make_repository('repo', format='2a')
++        self.assertTrue(repo._format.pack_compresses)
      def test_inventories_use_chk_map_with_parent_base_dict(self):
--        tree = self.make_branch_and_tree('repo', format="development6-rich-root")
++        tree = self.make_branch_and_tree('repo', format="2a")
          revid = tree.commit("foo")
          tree.lock_read()
          self.addCleanup(tree.unlock)
@@ -688,14 +692,33 @@
          self.assertEqual(65536,
              inv.parent_id_basename_to_file_id._root_node.maximum_size)
++    def test_pack_with_hint(self):
++        tree = self.make_branch_and_tree('tree', format='2a')
++        # 1 commit to leave untouched
++        tree.commit('1')
++        to_keep = tree.branch.repository._pack_collection.names()
++        # 2 to combine
++        tree.commit('2')
++        tree.commit('3')
++        all = tree.branch.repository._pack_collection.names()
++        combine = list(set(all) - set(to_keep))
++        self.assertLength(3, all)
++        self.assertLength(2, combine)
++        tree.branch.repository.pack(hint=combine)
++        final = tree.branch.repository._pack_collection.names()
++        self.assertLength(2, final)
++        self.assertFalse(combine[0] in final)
++        self.assertFalse(combine[1] in final)
++        self.assertSubset(to_keep, final)
++
      def test_stream_source_to_gc(self):
--        source = self.make_repository('source', format='development6-rich-root')
--        target = self.make_repository('target', format='development6-rich-root')
++        source = self.make_repository('source', format='2a')
++        target = self.make_repository('target', format='2a')
          stream = source._get_source(target._format)
          self.assertIsInstance(stream, groupcompress_repo.GroupCHKStreamSource)
      def test_stream_source_to_non_gc(self):
--        source = self.make_repository('source', format='development6-rich-root')
++        source = self.make_repository('source', format='2a')
          target = self.make_repository('target', format='rich-root-pack')
          stream = source._get_source(target._format)
          # We don't want the child GroupCHKStreamSource
@@ -703,7 +726,7 @@
      def test_get_stream_for_missing_keys_includes_all_chk_refs(self):
          source_builder = self.make_branch_builder('source',
--                            format='development6-rich-root')
++                            format='2a')
          # We have to build a fairly large tree, so that we are sure the chk
          # pages will have split into multiple pages.
          entries = [('add', ('', 'a-root-id', 'directory', None))]
@@ -726,7 +749,7 @@
          source_branch = source_builder.get_branch()
          source_branch.lock_read()
          self.addCleanup(source_branch.unlock)
--        target = self.make_repository('target', format='development6-rich-root')
++        target = self.make_repository('target', format='2a')
          source = source_branch.repository._get_source(target._format)
          self.assertIsInstance(source, groupcompress_repo.GroupCHKStreamSource)
@@ -1354,3 +1377,83 @@
          self.assertTrue(new_pack.inventory_index._optimize_for_size)
          self.assertTrue(new_pack.text_index._optimize_for_size)
          self.assertTrue(new_pack.signature_index._optimize_for_size)
++
++
++class TestCrossFormatPacks(TestCaseWithTransport):
++
++    def log_pack(self, hint=None):
++        self.calls.append(('pack', hint))
++        self.orig_pack(hint=hint)
++        if self.expect_hint:
++            self.assertTrue(hint)
++
++    def run_stream(self, src_fmt, target_fmt, expect_pack_called):
++        self.expect_hint = expect_pack_called
++        self.calls = []
++        source_tree = self.make_branch_and_tree('src', format=src_fmt)
++        source_tree.lock_write()
++        self.addCleanup(source_tree.unlock)
++        tip = source_tree.commit('foo')
++        target = self.make_repository('target', format=target_fmt)
++        target.lock_write()
++        self.addCleanup(target.unlock)
++        source = source_tree.branch.repository._get_source(target._format)
++        self.orig_pack = target.pack
++        target.pack = self.log_pack
++        search = target.search_missing_revision_ids(
++            source_tree.branch.repository, tip)
++        stream = source.get_stream(search)
++        from_format = source_tree.branch.repository._format
++        sink = target._get_sink()
++        sink.insert_stream(stream, from_format, [])
++        if expect_pack_called:
++            self.assertLength(1, self.calls)
++        else:
++            self.assertLength(0, self.calls)
++
++    def run_fetch(self, src_fmt, target_fmt, expect_pack_called):
++        self.expect_hint = expect_pack_called
++        self.calls = []
++        source_tree = self.make_branch_and_tree('src', format=src_fmt)
++        source_tree.lock_write()
++        self.addCleanup(source_tree.unlock)
++        tip = source_tree.commit('foo')
++        target = self.make_repository('target', format=target_fmt)
++        target.lock_write()
++        self.addCleanup(target.unlock)
++        source = source_tree.branch.repository
++        self.orig_pack = target.pack
++        target.pack = self.log_pack
++        target.fetch(source)
++        if expect_pack_called:
++            self.assertLength(1, self.calls)
++        else:
++            self.assertLength(0, self.calls)
++
++    def test_sink_format_hint_no(self):
++        # When the target format says packing makes no difference, pack is not
++        # called.
++        self.run_stream('1.9', 'rich-root-pack', False)
++
++    def test_sink_format_hint_yes(self):
++        # When the target format says packing makes a difference, pack is
++        # called.
++        self.run_stream('1.9', '2a', True)
++
++    def test_sink_format_same_no(self):
++        # When the formats are the same, pack is not called.
++        self.run_stream('2a', '2a', False)
++
++    def test_IDS_format_hint_no(self):
++        # When the target format says packing makes no difference, pack is not
++        # called.
++        self.run_fetch('1.9', 'rich-root-pack', False)
++
++    def test_IDS_format_hint_yes(self):
++        # When the target format says packing makes a difference, pack is
++        # called.
++        self.run_fetch('1.9', '2a', True)
++
++    def test_IDS_format_same_no(self):
++        # When the formats are the same, pack is not called.
++        self.run_fetch('2a', '2a', False)