Bazaar

Merge lp:~spiv/bzr/insert-stream-check-chks-part-2 into lp:bzr/2.0

insert-stream-check-chks-part-2
Merge into 2.0

Proposed by Andrew Bennetts on 2009-09-07

Status:

Merged

Merged at revision:

not available

Proposed branch:

lp:~spiv/bzr/insert-stream-check-chks-part-2

Merge into:

lp:bzr/2.0

Diff against target:

None lines

To merge this branch:

bzr merge lp:~spiv/bzr/insert-stream-check-chks-part-2

Critical

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Martin Pool		2009-09-07	Needs Fixing on 2009-09-08
Review via email: mp+11290@code.launchpad.net

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-09-07:

This completes the fix for bug 406687.

We could go much further I think in terms of reusing code. I've taken a nibble at that in this branch but there's room to do much, much better. We now have three code paths in groupcompress_repo.py that are doing the essentially the same logic of finding text references from a set of revisions:

- GroupCHKStreamSource,
- fileids_altered_by_revision_ids, and
- _check_new_inventories (the new checks in this patch).

However I think the incremental improvement in this patch is landable without tackling all the duplication at once. (But I'm happy to do any more low-hanging fruit that a reviewer considers appropriate.)

Revision history for this message

Robert Collins (lifeless) wrote on 2009-09-07:

So a few things.

Firstly, check is trying to move away from raising a single error at the first sign of trouble; raising BzrCheckError in code check uses takes that back a few steps. You could pass down a flag asking for an exception, or something. This occurs at least twice. I may be misreading though - the code in question might not be used by check at all.

+ repo = self.make_repository('damaged-repo')
255 + if not repo._format.supports_chks:
256 + raise TestNotApplicable('requires repository with chk_bytes')

Please move these tests to per_repository_chks

Other than that it looks plausible to me.

Revision history for this message

Martin Pool (mbp) wrote on 2009-09-08:

Download full text (13.2 KiB)

It seems like you should have a news entry before you merge.

=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py 2009-09-08 02:09:09 +0000
+++ bzrlib/groupcompress.py 2009-09-08 02:30:52 +0000
@@ -1182,6 +1182,12 @@
self._group_cache = LRUSizeCache(max_size=50*1024*1024)
self._fallback_vfs = []

+ def without_fallbacks(self):

OK, it's fairly obvious, but a docstring would still be good.

+ gcvf = GroupCompressVersionedFiles(
+ self._index, self._access, self._delta)
+ gcvf._unadded_refs = dict(self._unadded_refs)
+ return gcvf
+

It seems a bit potentially problematic to poke this in post
construction, if someone for instance changes the constructor so that it
does actually start opening things. How about instead adding a private
parameter to the constructor to pass this in?

     def add_lines(self, key, parents, lines, parent_texts=None,
         left_matching_blocks=None, nostore_sha=None, random_id=False,
         check_content=True):

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- bzrlib/repofmt/groupcompress_repo.py 2009-09-08 02:09:09 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-09-08 02:30:52 +0000
@@ -591,8 +591,6 @@
         :returns: set of missing keys. Note that not every missing key is
             guaranteed to be reported.
         """
- if getattr(self.repo, 'chk_bytes', None) is None:
- return set()
         # Ensure that all revisions added in this write group have:
         # - corresponding inventories,
         # - chk root entries for those inventories,
@@ -603,6 +601,7 @@
         new_revisions_keys = key_deps.get_new_keys()
         no_fallback_inv_index = self.repo.inventories._index
         no_fallback_chk_bytes_index = self.repo.chk_bytes._index
+ no_fallback_texts_index = self.repo.texts._index
         inv_parent_map = no_fallback_inv_index.get_parent_map(
             new_revisions_keys)
         # Are any inventories for corresponding to the new revisions missing?
@@ -610,7 +609,9 @@
         missing_corresponding = set(new_revisions_keys)
         missing_corresponding.difference_update(corresponding_invs)
         if missing_corresponding:
- return [('inventories', key) for key in missing_corresponding]
+ raise errors.BzrCheckError(
+ "Repository %s missing inventories for new revisions %r "
+ % (self.repo, sorted(missing_corresponding)))

As Robert says, it seems like a step backwards to be raising an
exception directly rather than returning a list of problems - is there a
reason why you have to?

It seems like you should have a news entry before you merge.

=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py	2009-09-08 02:09:09 +0000
+++ bzrlib/groupcompress.py	2009-09-08 02:30:52 +0000
@@ -1182,6 +1182,12 @@
         self._group_cache = LRUSizeCache(max_size=50*1024*1024)
         self._fallback_vfs = []
 
+    def without_fallbacks(self):

OK, it's fairly obvious, but a docstring would still be good.

+        gcvf = GroupCompressVersionedFiles(
+            self._index, self._access, self._delta)
+        gcvf._unadded_refs = dict(self._unadded_refs)
+        return gcvf
+

It seems a bit potentially problematic to poke this in post
construction, if someone for instance changes the constructor so that it
does actually start opening things.  How about instead adding a private
parameter to the constructor to pass this in?

def add_lines(self, key, parents, lines, parent_texts=None,
         left_matching_blocks=None, nostore_sha=None, random_id=False,
         check_content=True):

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- bzrlib/repofmt/groupcompress_repo.py	2009-09-08 02:09:09 +0000
+++ bzrlib/repofmt/groupcompress_repo.py	2009-09-08 02:30:52 +0000
@@ -591,8 +591,6 @@
         :returns: set of missing keys.  Note that not every missing key is
             guaranteed to be reported.
         """
-        if getattr(self.repo, 'chk_bytes', None) is None:
-            return set()
         # Ensure that all revisions added in this write group have:
         #   - corresponding inventories,
         #   - chk root entries for those inventories,
@@ -603,6 +601,7 @@
         new_revisions_keys = key_deps.get_new_keys()
         no_fallback_inv_index = self.repo.inventories._index
         no_fallback_chk_bytes_index = self.repo.chk_bytes._index
+        no_fallback_texts_index = self.repo.texts._index
         inv_parent_map = no_fallback_inv_index.get_parent_map(
             new_revisions_keys)
         # Are any inventories for corresponding to the new revisions missing?
@@ -610,7 +609,9 @@
         missing_corresponding = set(new_revisions_keys)
         missing_corresponding.difference_update(corresponding_invs)
         if missing_corresponding:
-            return [('inventories', key) for key in missing_corresponding]
+            raise errors.BzrCheckError(
+                "Repository %s missing inventories for new revisions %r "
+                 % (self.repo, sorted(missing_corresponding)))

As Robert says, it seems like a step backwards to be raising an
exception directly rather than returning a list of problems - is there a
reason why you have to?

# Are any chk root entries missing for any inventories?  This includes
         # any present parent inventories, which may be used when calculating
         # deltas for streaming.
@@ -620,17 +621,57 @@
         # Filter out ghost parents.
         all_inv_keys.intersection_update(
             no_fallback_inv_index.get_parent_map(all_inv_keys))
+        parent_invs_only_keys = all_inv_keys.symmetric_difference(
+            corresponding_invs)
         all_missing = set()
         inv_ids = [key[-1] for key in all_inv_keys]
-        for inv in self.repo.iter_inventories(inv_ids, 'unordered'):
-            root_keys = set([inv.id_to_entry.key()])
-            if inv.parent_id_basename_to_file_id is not None:
-                root_keys.add(inv.parent_id_basename_to_file_id.key())
-            present = no_fallback_chk_bytes_index.get_parent_map(root_keys)
-            missing = root_keys.difference(present)
-            all_missing.update([('chk_bytes',) + key for key in missing])
-        return all_missing
-        
+        parent_invs_only_ids = [key[-1] for key in parent_invs_only_keys]
+        root_key_info = _build_interesting_key_sets(
+            self.repo, inv_ids, parent_invs_only_ids)
+        expected_chk_roots = root_key_info.all_keys()
+        present_chk_roots = no_fallback_chk_bytes_index.get_parent_map(
+            expected_chk_roots)
+        missing_chk_roots = expected_chk_roots.difference(present_chk_roots)
+        if missing_chk_roots:
+            # Don't bother checking any further.
+            raise errors.BzrCheckError(
+                "Repository %s missing chk root keys %r for new revisions"
+                 % (self.repo, sorted(missing_chk_roots)))
+        # Find all interesting chk_bytes records, and make sure they are
+        # present, as well as the text keys they reference.
+        chk_bytes_no_fallbacks = self.repo.chk_bytes.without_fallbacks()
+        chk_bytes_no_fallbacks._search_key_func = \
+            self.repo.chk_bytes._search_key_func
+        chk_diff = chk_map.iter_interesting_nodes(
+            chk_bytes_no_fallbacks, root_key_info.interesting_root_keys,
+            root_key_info.uninteresting_root_keys)
+        bytes_to_info = inventory.CHKInventory._bytes_to_utf8name_key
+        text_keys = set()
+        try:
+            for record in _filter_text_keys(chk_diff, text_keys, bytes_to_info):
+                pass
+        except errors.NoSuchRevision, e:
+            # XXX: It would be nice if we could give a more precise error here.
+            raise errors.BzrCheckError(
+                "Repository %s missing chk node(s) for new revisions."
+                % (self.repo,))
+        chk_diff = chk_map.iter_interesting_nodes(
+            chk_bytes_no_fallbacks, root_key_info.interesting_pid_root_keys,
+            root_key_info.uninteresting_pid_root_keys)
+        try:
+            for interesting_rec, interesting_map in chk_diff:
+                pass
+        except errors.NoSuchRevision, e:
+            raise errors.BzrCheckError(
+                "Repository %s missing chk node(s) for new revisions."
+                % (self.repo,))
+        present_text_keys = no_fallback_texts_index.get_parent_map(text_keys)
+        missing_text_keys = text_keys.difference(present_text_keys)
+        if missing_text_keys:
+            raise errors.BzrCheckError(
+                "Repository %s missing text keys %r for new revisions"
+                 % (self.repo, sorted(missing_text_keys)))
+

The checks look reasonable but again it seems like we should return
results rather than raising errors?

def _execute_pack_operations(self, pack_operations,
                                  _packer_class=GCCHKPacker,
                                  reload_func=None):
@@ -1111,6 +1144,54 @@
             yield stream_info
 
 
+class _InterestingKeyInfo(object):
+    def __init__(self):
+        self.interesting_root_keys = set()
+        self.interesting_pid_root_keys = set()
+        self.uninteresting_root_keys = set()
+        self.uninteresting_pid_root_keys = set()
+
+    def all_interesting(self):
+        return self.interesting_root_keys.union(self.interesting_pid_root_keys)
+
+    def all_uninteresting(self):
+        return self.uninteresting_root_keys.union(
+            self.uninteresting_pid_root_keys)
+
+    def all_keys(self):
+        return self.all_interesting().union(self.all_uninteresting())
+
+
+def _build_interesting_key_sets(repo, inventory_ids, parent_only_inv_ids):
+    result = _InterestingKeyInfo()
+    for inv in repo.iter_inventories(inventory_ids, 'unordered'):
+        root_key = inv.id_to_entry.key()
+        pid_root_key = inv.parent_id_basename_to_file_id.key()
+        if inv.revision_id in parent_only_inv_ids:
+            result.uninteresting_root_keys.add(root_key)
+            result.uninteresting_pid_root_keys.add(pid_root_key)
+        else:
+            result.interesting_root_keys.add(root_key)
+            result.interesting_pid_root_keys.add(pid_root_key)
+    return result

I always wonder whether this kind of thing should be a constructor, a
factory classmethod, or just a function as you have here.

+
+
+def _filter_text_keys(interesting_nodes_iterable, text_keys, bytes_to_info):
+    """Iterate the result of iter_interesting_nodes, yielding the records
+    and adding to text_keys.
+    """
+    for record, items in interesting_nodes_iterable:
+        for name, bytes in items:
+            # Note: we don't care about name_utf8, because groupcompress repos
+            # are always rich-root, so there are no synthesised root records to
+            # ignore.
+            _, file_id, revision_id = bytes_to_info(bytes)
+            text_keys.add((file_id, revision_id))
+        yield record
+
+
+
+
 class RepositoryFormatCHK1(RepositoryFormatPack):
     """A hashed CHK+group compress pack repository."""

=== modified file 'bzrlib/tests/per_repository/test_write_group.py'
--- bzrlib/tests/per_repository/test_write_group.py	2009-09-08 02:09:09 +0000
+++ bzrlib/tests/per_repository/test_write_group.py	2009-09-08 02:30:52 +0000
@@ -365,16 +365,16 @@
         """commit_write_group fails with BzrCheckError when the chk root record
         for a new inventory is missing.
         """
+        repo = self.make_repository('damaged-repo')
+        if not repo._format.supports_chks:
+            raise TestNotApplicable('requires repository with chk_bytes')

This is a common pattern, but it's somewhat inefficient to do all the
work of building a repository and then decide we don't want it, when we
could presumably just look at the already-existing format object.

Maybe it should be actually calling make_chk_repository() and that can
raise if it's not supported?  You've inserted this multiple times.

Are these just moving in the tests to do less work before they skip?

builder = self.make_branch_builder('simple-branch')
         builder.build_snapshot('A-id', None, [
             ('add', ('', 'root-id', 'directory', None)),
             ('add', ('file', 'file-id', 'file', 'content\n'))])
         b = builder.get_branch()
-        if not b.repository._format.supports_chks:
-            raise TestNotApplicable('requires repository with chk_bytes')
         b.lock_read()
         self.addCleanup(b.unlock)
-        repo = self.make_repository('damaged-repo')
         repo.lock_write()
         repo.start_write_group()
         # Now, add the objects manually
@@ -411,6 +411,9 @@
         (In principle the chk records are unnecessary in this case, but in
         practice bzr 2.0rc1 (at least) expects to find them.)
         """
+        repo = self.make_repository('damaged-repo')
+        if not repo._format.supports_chks:
+            raise TestNotApplicable('requires repository with chk_bytes')
         # Make a branch where the last two revisions have identical
         # inventories.
         builder = self.make_branch_builder('simple-branch')
@@ -420,8 +423,6 @@
         builder.build_snapshot('B-id', None, [])
         builder.build_snapshot('C-id', None, [])
         b = builder.get_branch()
-        if not b.repository._format.supports_chks:
-            raise TestNotApplicable('requires repository with chk_bytes')
         b.lock_read()
         self.addCleanup(b.unlock)
         # check our setup: B-id and C-id should have identical chk root keys.
@@ -433,10 +434,71 @@
         # We need ('revisions', 'C-id'), ('inventories', 'C-id'),
         # ('inventories', 'B-id'), and the corresponding chk roots for those
         # inventories.
+        repo.lock_write()
+        repo.start_write_group()
+        src_repo = b.repository
+        repo.inventories.insert_record_stream(
+            src_repo.inventories.get_record_stream(
+                [('B-id',), ('C-id',)], 'unordered', True))
+        repo.revisions.insert_record_stream(
+            src_repo.revisions.get_record_stream(
+                [('C-id',)], 'unordered', True))
+        # Make sure the presence of the missing data in a fallback does not
+        # avoid the error.
+        repo.add_fallback_repository(b.repository)
+        self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
+        reopened_repo = self.reopen_repo_and_resume_write_group(repo)
+        self.assertRaises(
+            errors.BzrCheckError, reopened_repo.commit_write_group)
+        reopened_repo.abort_write_group()
+

+    def test_missing_chk_leaf_for_inventory(self):
+        """commit_write_group fails with BzrCheckError when the chk root record
+        for a parent inventory of a new revision is missing.
+        """
         repo = self.make_repository('damaged-repo')
+        if not repo._format.supports_chks:
+            raise TestNotApplicable('requires repository with chk_bytes')
+        builder = self.make_branch_builder('simple-branch')
+        # add and modify files with very long file-ids, so that the chk map
+        # will need more than just a root node.
+        file_adds = []
+        file_modifies = []
+        for char in 'abc':
+            name = char * 10000
+            file_adds.append(
+                ('add', ('file-' + name, 'file-%s-id' % name, 'file',
+                         'content %s\n' % name)))
+            file_modifies.append(
+                ('modify', ('file-%s-id' % name, 'new content %s\n' % name)))
+        builder.build_snapshot('A-id', None, [
+            ('add', ('', 'root-id', 'directory', None))] +
+            file_adds)
+        builder.build_snapshot('B-id', None, [])
+        builder.build_snapshot('C-id', None, file_modifies)
+        b = builder.get_branch()

This seems redundant with make_branch_with_multiple_chk_nodes...?

I think it is a very good candidate to be lifted out.

The other tests look pretty clear though there does seem to be a bit of
duplication...

review: Needs Fixing

Revision history for this message

Martin Pool (mbp) wrote on 2009-09-08:

that status is meant to mean 'tweak', but feel free to call me to talk about it.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alexander Belchenko

Andrew Bennetts

Bazaar Codereview Subscribers

Benoit Pierre

Martin Pool

Matt Nordhoff

bzr PQM

pascalprost

Bazaar

Merge lp:~spiv/bzr/insert-stream-check-chks-part-2 into lp:bzr/2.0

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'bzrlib/groupcompress.py'
 --- bzrlib/groupcompress.py	2009-09-04 07:39:08 +0000
 +++ bzrlib/groupcompress.py	2009-09-07 05:52:04 +0000
@@ -1182,6 +1182,12 @@
          self._group_cache = LRUSizeCache(max_size=50*1024*1024)
          self._fallback_vfs = []
++    def without_fallbacks(self):
++        gcvf = GroupCompressVersionedFiles(
++            self._index, self._access, self._delta)
++        gcvf._unadded_refs = dict(self._unadded_refs)
++        return gcvf
++
      def add_lines(self, key, parents, lines, parent_texts=None,
          left_matching_blocks=None, nostore_sha=None, random_id=False,
          check_content=True):
 === modified file 'bzrlib/repofmt/groupcompress_repo.py'
 --- bzrlib/repofmt/groupcompress_repo.py	2009-09-07 03:00:23 +0000
 +++ bzrlib/repofmt/groupcompress_repo.py	2009-09-07 08:39:36 +0000
@@ -591,8 +591,6 @@
          :returns: set of missing keys.  Note that not every missing key is
              guaranteed to be reported.
          """
--        if getattr(self.repo, 'chk_bytes', None) is None:
--            return set()
          # Ensure that all revisions added in this write group have:
          #   - corresponding inventories,
          #   - chk root entries for those inventories,
@@ -603,6 +601,7 @@
          new_revisions_keys = key_deps.get_new_keys()
          no_fallback_inv_index = self.repo.inventories._index
          no_fallback_chk_bytes_index = self.repo.chk_bytes._index
++        no_fallback_texts_index = self.repo.texts._index
          inv_parent_map = no_fallback_inv_index.get_parent_map(
              new_revisions_keys)
          # Are any inventories for corresponding to the new revisions missing?
@@ -610,7 +609,9 @@
          missing_corresponding = set(new_revisions_keys)
          missing_corresponding.difference_update(corresponding_invs)
          if missing_corresponding:
--            return [('inventories', key) for key in missing_corresponding]
++            raise errors.BzrCheckError(
++                "Repository %s missing inventories for new revisions %r "
++                 % (self.repo, sorted(missing_corresponding)))
          # Are any chk root entries missing for any inventories?  This includes
          # any present parent inventories, which may be used when calculating
          # deltas for streaming.
@@ -620,17 +621,57 @@
          # Filter out ghost parents.
          all_inv_keys.intersection_update(
              no_fallback_inv_index.get_parent_map(all_inv_keys))
++        parent_invs_only_keys = all_inv_keys.symmetric_difference(
++            corresponding_invs)
          all_missing = set()
          inv_ids = [key[-1] for key in all_inv_keys]
--        for inv in self.repo.iter_inventories(inv_ids, 'unordered'):
--            root_keys = set([inv.id_to_entry.key()])
--            if inv.parent_id_basename_to_file_id is not None:
--                root_keys.add(inv.parent_id_basename_to_file_id.key())
--            present = no_fallback_chk_bytes_index.get_parent_map(root_keys)
--            missing = root_keys.difference(present)
--            all_missing.update([('chk_bytes',) + key for key in missing])
--        return all_missing
--
++        parent_invs_only_ids = [key[-1] for key in parent_invs_only_keys]
++        root_key_info = _build_interesting_key_sets(
++            self.repo, inv_ids, parent_invs_only_ids)
++        expected_chk_roots = root_key_info.all_keys()
++        present_chk_roots = no_fallback_chk_bytes_index.get_parent_map(
++            expected_chk_roots)
++        missing_chk_roots = expected_chk_roots.difference(present_chk_roots)
++        if missing_chk_roots:
++            # Don't bother checking any further.
++            raise errors.BzrCheckError(
++                "Repository %s missing chk root keys %r for new revisions"
++                 % (self.repo, sorted(missing_chk_roots)))
++        # Find all interesting chk_bytes records, and make sure they are
++        # present, as well as the text keys they reference.
++        chk_bytes_no_fallbacks = self.repo.chk_bytes.without_fallbacks()
++        chk_bytes_no_fallbacks._search_key_func = \
++            self.repo.chk_bytes._search_key_func
++        chk_diff = chk_map.iter_interesting_nodes(
++            chk_bytes_no_fallbacks, root_key_info.interesting_root_keys,
++            root_key_info.uninteresting_root_keys)
++        bytes_to_info = inventory.CHKInventory._bytes_to_utf8name_key
++        text_keys = set()
++        try:
++            for record in _filter_text_keys(chk_diff, text_keys, bytes_to_info):
++                pass
++        except errors.NoSuchRevision, e:
++            # XXX: It would be nice if we could give a more precise error here.
++            raise errors.BzrCheckError(
++                "Repository %s missing chk node(s) for new revisions."
++                % (self.repo,))
++        chk_diff = chk_map.iter_interesting_nodes(
++            chk_bytes_no_fallbacks, root_key_info.interesting_pid_root_keys,
++            root_key_info.uninteresting_pid_root_keys)
++        try:
++            for interesting_rec, interesting_map in chk_diff:
++                pass
++        except errors.NoSuchRevision, e:
++            raise errors.BzrCheckError(
++                "Repository %s missing chk node(s) for new revisions."
++                % (self.repo,))
++        present_text_keys = no_fallback_texts_index.get_parent_map(text_keys)
++        missing_text_keys = text_keys.difference(present_text_keys)
++        if missing_text_keys:
++            raise errors.BzrCheckError(
++                "Repository %s missing text keys %r for new revisions"
++                 % (self.repo, sorted(missing_text_keys)))
++
      def _execute_pack_operations(self, pack_operations,
                                   _packer_class=GCCHKPacker,
                                   reload_func=None):
@@ -898,17 +939,12 @@
                                          parent_keys)
              present_parent_inv_ids = set(
                  [k[-1] for k in present_parent_inv_keys])
--            uninteresting_root_keys = set()
--            interesting_root_keys = set()
              inventories_to_read = set(revision_ids)
              inventories_to_read.update(present_parent_inv_ids)
--            for inv in self.iter_inventories(inventories_to_read):
--                entry_chk_root_key = inv.id_to_entry.key()
--                if inv.revision_id in present_parent_inv_ids:
--                    uninteresting_root_keys.add(entry_chk_root_key)
--                else:
--                    interesting_root_keys.add(entry_chk_root_key)
--
++            root_key_info = _build_interesting_key_sets(
++                self, inventories_to_read, present_parent_inv_ids)
++            interesting_root_keys = root_key_info.interesting_root_keys
++            uninteresting_root_keys = root_key_info.uninteresting_root_keys
              chk_bytes = self.chk_bytes
              for record, items in chk_map.iter_interesting_nodes(chk_bytes,
                          interesting_root_keys, uninteresting_root_keys,
@@ -1048,13 +1084,10 @@
          bytes_to_info = inventory.CHKInventory._bytes_to_utf8name_key
          chk_bytes = self.from_repository.chk_bytes
          def _filter_id_to_entry():
--            for record, items in chk_map.iter_interesting_nodes(chk_bytes,
--                        self._chk_id_roots, uninteresting_root_keys):
--                for name, bytes in items:
--                    # Note: we don't care about name_utf8, because we are always
--                    # rich-root = True
--                    _, file_id, revision_id = bytes_to_info(bytes)
--                    self._text_keys.add((file_id, revision_id))
++            interesting_nodes = chk_map.iter_interesting_nodes(chk_bytes,
++                        self._chk_id_roots, uninteresting_root_keys)
++            for record in _filter_text_keys(interesting_nodes, self._text_keys,
++                    bytes_to_info):
                  if record is not None:
                      yield record
              # Consumed
@@ -1098,7 +1131,7 @@
              missing_inventory_keys.add(key[1:])
          if self._chk_id_roots or self._chk_p_id_roots:
              raise AssertionError('Cannot call get_stream_for_missing_keys'
--                ' untill all of get_stream() has been consumed.')
++                ' until all of get_stream() has been consumed.')
          # Yield the inventory stream, so we can find the chk stream
          # Some of the missing_keys will be missing because they are ghosts.
          # As such, we can ignore them. The Sink is required to verify there are
@@ -1111,6 +1144,54 @@
              yield stream_info
++class _InterestingKeyInfo(object):
++    def __init__(self):
++        self.interesting_root_keys = set()
++        self.interesting_pid_root_keys = set()
++        self.uninteresting_root_keys = set()
++        self.uninteresting_pid_root_keys = set()
++
++    def all_interesting(self):
++        return self.interesting_root_keys.union(self.interesting_pid_root_keys)
++
++    def all_uninteresting(self):
++        return self.uninteresting_root_keys.union(
++            self.uninteresting_pid_root_keys)
++
++    def all_keys(self):
++        return self.all_interesting().union(self.all_uninteresting())
++
++
++def _build_interesting_key_sets(repo, inventory_ids, parent_only_inv_ids):
++    result = _InterestingKeyInfo()
++    for inv in repo.iter_inventories(inventory_ids, 'unordered'):
++        root_key = inv.id_to_entry.key()
++        pid_root_key = inv.parent_id_basename_to_file_id.key()
++        if inv.revision_id in parent_only_inv_ids:
++            result.uninteresting_root_keys.add(root_key)
++            result.uninteresting_pid_root_keys.add(pid_root_key)
++        else:
++            result.interesting_root_keys.add(root_key)
++            result.interesting_pid_root_keys.add(pid_root_key)
++    return result
++
++
++def _filter_text_keys(interesting_nodes_iterable, text_keys, bytes_to_info):
++    """Iterate the result of iter_interesting_nodes, yielding the records
++    and adding to text_keys.
++    """
++    for record, items in interesting_nodes_iterable:
++        for name, bytes in items:
++            # Note: we don't care about name_utf8, because groupcompress repos
++            # are always rich-root, so there are no synthesised root records to
++            # ignore.
++            _, file_id, revision_id = bytes_to_info(bytes)
++            text_keys.add((file_id, revision_id))
++        yield record
++
++
++
++
  class RepositoryFormatCHK1(RepositoryFormatPack):
      """A hashed CHK+group compress pack repository."""
 === modified file 'bzrlib/repofmt/pack_repo.py'
 --- bzrlib/repofmt/pack_repo.py	2009-09-07 03:00:23 +0000
 +++ bzrlib/repofmt/pack_repo.py	2009-09-07 06:01:48 +0000
@@ -2071,7 +2071,7 @@
          """
          # The base implementation does no checks.  GCRepositoryPackCollection
          # overrides this.
--        return set()
++        pass
      def _commit_write_group(self):
          all_missing = set()
@@ -2087,11 +2087,7 @@
              raise errors.BzrCheckError(
                  "Repository %s has missing compression parent(s) %r "
                   % (self.repo, sorted(all_missing)))
--        all_missing = self._check_new_inventories()
--        if all_missing:
--            raise errors.BzrCheckError(
--                "Repository %s missing keys for new revisions %r "
--                 % (self.repo, sorted(all_missing)))
++        self._check_new_inventories()
          self._remove_pack_indices(self._new_pack)
          any_new_content = False
          if self._new_pack.data_inserted():
 === modified file 'bzrlib/tests/per_repository/test_write_group.py'
 --- bzrlib/tests/per_repository/test_write_group.py	2009-09-02 03:07:23 +0000
 +++ bzrlib/tests/per_repository/test_write_group.py	2009-09-07 08:06:25 +0000
@@ -365,16 +365,16 @@
          """commit_write_group fails with BzrCheckError when the chk root record
          for a new inventory is missing.
          """
++        repo = self.make_repository('damaged-repo')
++        if not repo._format.supports_chks:
++            raise TestNotApplicable('requires repository with chk_bytes')
          builder = self.make_branch_builder('simple-branch')
          builder.build_snapshot('A-id', None, [
              ('add', ('', 'root-id', 'directory', None)),
              ('add', ('file', 'file-id', 'file', 'content\n'))])
          b = builder.get_branch()
--        if not b.repository._format.supports_chks:
--            raise TestNotApplicable('requires repository with chk_bytes')
          b.lock_read()
          self.addCleanup(b.unlock)
--        repo = self.make_repository('damaged-repo')
          repo.lock_write()
          repo.start_write_group()
          # Now, add the objects manually
@@ -411,6 +411,9 @@
          (In principle the chk records are unnecessary in this case, but in
          practice bzr 2.0rc1 (at least) expects to find them.)
          """
++        repo = self.make_repository('damaged-repo')
++        if not repo._format.supports_chks:
++            raise TestNotApplicable('requires repository with chk_bytes')
          # Make a branch where the last two revisions have identical
          # inventories.
          builder = self.make_branch_builder('simple-branch')
@@ -420,8 +423,6 @@
          builder.build_snapshot('B-id', None, [])
          builder.build_snapshot('C-id', None, [])
          b = builder.get_branch()
--        if not b.repository._format.supports_chks:
--            raise TestNotApplicable('requires repository with chk_bytes')
          b.lock_read()
          self.addCleanup(b.unlock)
          # check our setup: B-id and C-id should have identical chk root keys.
@@ -433,10 +434,71 @@
          # We need ('revisions', 'C-id'), ('inventories', 'C-id'),
          # ('inventories', 'B-id'), and the corresponding chk roots for those
          # inventories.
++        repo.lock_write()
++        repo.start_write_group()
++        src_repo = b.repository
++        repo.inventories.insert_record_stream(
++            src_repo.inventories.get_record_stream(
++                [('B-id',), ('C-id',)], 'unordered', True))
++        repo.revisions.insert_record_stream(
++            src_repo.revisions.get_record_stream(
++                [('C-id',)], 'unordered', True))
++        # Make sure the presence of the missing data in a fallback does not
++        # avoid the error.
++        repo.add_fallback_repository(b.repository)
++        self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
++        reopened_repo = self.reopen_repo_and_resume_write_group(repo)
++        self.assertRaises(
++            errors.BzrCheckError, reopened_repo.commit_write_group)
++        reopened_repo.abort_write_group()
++
++    def test_missing_chk_leaf_for_inventory(self):
++        """commit_write_group fails with BzrCheckError when the chk root record
++        for a parent inventory of a new revision is missing.
++        """
          repo = self.make_repository('damaged-repo')
++        if not repo._format.supports_chks:
++            raise TestNotApplicable('requires repository with chk_bytes')
++        builder = self.make_branch_builder('simple-branch')
++        # add and modify files with very long file-ids, so that the chk map
++        # will need more than just a root node.
++        file_adds = []
++        file_modifies = []
++        for char in 'abc':
++            name = char * 10000
++            file_adds.append(
++                ('add', ('file-' + name, 'file-%s-id' % name, 'file',
++                         'content %s\n' % name)))
++            file_modifies.append(
++                ('modify', ('file-%s-id' % name, 'new content %s\n' % name)))
++        builder.build_snapshot('A-id', None, [
++            ('add', ('', 'root-id', 'directory', None))] +
++            file_adds)
++        builder.build_snapshot('B-id', None, [])
++        builder.build_snapshot('C-id', None, file_modifies)
++        b = builder.get_branch()
++        src_repo = b.repository
++        src_repo.lock_read()
++        self.addCleanup(src_repo.unlock)
++        # Now, manually insert objects for a stacked repo with only revision
++        # C-id, *except* drop the non-root chk records.
++        inv_b = src_repo.get_inventory('B-id')
++        inv_c = src_repo.get_inventory('C-id')
++        chk_root_keys_only = [
++            inv_b.id_to_entry.key(), inv_b.parent_id_basename_to_file_id.key(),
++            inv_c.id_to_entry.key(), inv_c.parent_id_basename_to_file_id.key()]
++        all_chks = src_repo.chk_bytes.keys()
++        # Pick a non-root key to drop
++        key_to_drop = all_chks.difference(chk_root_keys_only).pop()
++        all_chks.discard(key_to_drop)
          repo.lock_write()
          repo.start_write_group()
--        src_repo = b.repository
++        repo.chk_bytes.insert_record_stream(
++            src_repo.chk_bytes.get_record_stream(
++                all_chks, 'unordered', True))
++        repo.texts.insert_record_stream(
++            src_repo.texts.get_record_stream(
++                src_repo.texts.keys(), 'unordered', True))
          repo.inventories.insert_record_stream(
              src_repo.inventories.get_record_stream(
                  [('B-id',), ('C-id',)], 'unordered', True))
@@ -456,16 +518,10 @@
          """commit_write_group fails with BzrCheckError when the chk root record
          for a parent inventory of a new revision is missing.
          """
--        builder = self.make_branch_builder('simple-branch')
--        builder.build_snapshot('A-id', None, [
--            ('add', ('', 'root-id', 'directory', None)),
--            ('add', ('file', 'file-id', 'file', 'content\n'))])
--        builder.build_snapshot('B-id', None, [])
--        builder.build_snapshot('C-id', None, [
--            ('modify', ('file-id', 'new-content'))])
--        b = builder.get_branch()
--        if not b.repository._format.supports_chks:
++        repo = self.make_repository('damaged-repo')
++        if not repo._format.supports_chks:
              raise TestNotApplicable('requires repository with chk_bytes')
++        b = self.make_branch_with_multiple_chk_nodes()
          b.lock_read()
          self.addCleanup(b.unlock)
          # Now, manually insert objects for a stacked repo with only revision
@@ -476,7 +532,6 @@
          inv_c = b.repository.get_inventory('C-id')
          chk_keys_for_c_only = [
              inv_c.id_to_entry.key(), inv_c.parent_id_basename_to_file_id.key()]
--        repo = self.make_repository('damaged-repo')
          repo.lock_write()
          repo.start_write_group()
          src_repo = b.repository
@@ -498,6 +553,63 @@
              errors.BzrCheckError, reopened_repo.commit_write_group)
          reopened_repo.abort_write_group()
++    def make_branch_with_multiple_chk_nodes(self):
++        # add and modify files with very long file-ids, so that the chk map
++        # will need more than just a root node.
++        builder = self.make_branch_builder('simple-branch')
++        file_adds = []
++        file_modifies = []
++        for char in 'abc':
++            name = char * 10000
++            file_adds.append(
++                ('add', ('file-' + name, 'file-%s-id' % name, 'file',
++                         'content %s\n' % name)))
++            file_modifies.append(
++                ('modify', ('file-%s-id' % name, 'new content %s\n' % name)))
++        builder.build_snapshot('A-id', None, [
++            ('add', ('', 'root-id', 'directory', None))] +
++            file_adds)
++        builder.build_snapshot('B-id', None, [])
++        builder.build_snapshot('C-id', None, file_modifies)
++        return builder.get_branch()
++
++    def test_missing_text_record(self):
++        """commit_write_group fails with BzrCheckError when a text is missing.
++        """
++        repo = self.make_repository('damaged-repo')
++        if not repo._format.supports_chks:
++            raise TestNotApplicable('requires repository with chk_bytes')
++        b = self.make_branch_with_multiple_chk_nodes()
++        src_repo = b.repository
++        src_repo.lock_read()
++        self.addCleanup(src_repo.unlock)
++        # Now, manually insert objects for a stacked repo with only revision
++        # C-id, *except* drop one changed text.
++        all_texts = src_repo.texts.keys()
++        all_texts.remove(('file-%s-id' % ('c'*10000,), 'C-id'))
++        repo.lock_write()
++        repo.start_write_group()
++        repo.chk_bytes.insert_record_stream(
++            src_repo.chk_bytes.get_record_stream(
++                src_repo.chk_bytes.keys(), 'unordered', True))
++        repo.texts.insert_record_stream(
++            src_repo.texts.get_record_stream(
++                all_texts, 'unordered', True))
++        repo.inventories.insert_record_stream(
++            src_repo.inventories.get_record_stream(
++                [('B-id',), ('C-id',)], 'unordered', True))
++        repo.revisions.insert_record_stream(
++            src_repo.revisions.get_record_stream(
++                [('C-id',)], 'unordered', True))
++        # Make sure the presence of the missing data in a fallback does not
++        # avoid the error.
++        repo.add_fallback_repository(b.repository)
++        self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
++        reopened_repo = self.reopen_repo_and_resume_write_group(repo)
++        self.assertRaises(
++            errors.BzrCheckError, reopened_repo.commit_write_group)
++        reopened_repo.abort_write_group()
++
  class TestResumeableWriteGroup(TestCaseWithRepository):