Merge into trunk-old : faster-commit-file : Code : Bazaar

Reviewer	Review Type	Date Requested	Status
Robert Collins (community)		2009-06-25	Needs Fixing on 2009-07-02
Review via email: mp+7885@code.launchpad.net

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-12: Posted in a previous version of this proposal

#

This branch makes commit of selected files on development6-rich-root faster than it is on 1.9 format, instead of being twice as slow. All tests still pass and the code changes are pretty simple so I think it is safe for 1.15rc.

It's possible to get further improvements here by pushing more complexity down into iter_changes as bug #347649 suggests. Given selective commit performance is down to 0.27 seconds for Emacs with 3K files and 105K revisions, I'm not sure that's necessary yet. That path is certainly a lot more work than this patch, given the multiple implementations of iter_changes() around the place and the tuning that has gone into them to date.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-05-12: Posted in a previous version of this proposal

#

On Tue, 2009-05-12 at 14:05 +0000, Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~bzr/bzr/faster-commit-file
> into lp:bzr.
>
> Requested reviews:
> Bazaar Developers (bzr)
>
> This branch makes commit of selected files on development6-rich-root
> faster than it is on 1.9 format, instead of being twice as slow. All
> tests still pass and the code changes are pretty simple so I think it
> is safe for 1.15rc.
>
> It's possible to get further improvements here by pushing more
> complexity down into iter_changes as bug #347649 suggests. Given
> selective commit performance is down to 0.27 seconds for Emacs with 3K
> files and 105K revisions, I'm not sure that's necessary yet. That path
> is certainly a lot more work than this patch, given the multiple
> implementations of iter_changes() around the place and the tuning that
> has gone into them to date.

This feels like a bit of a bandaid, and I don't think that the lack of
failures is an indication of correctness - there weren't tests for the
failures that bug 347649 refers to. For instance, simple inspection of
the patch suggests to me that you can miss the root id - e.g. a
pathological case would be a selective first commit with the root
skipped.

I appreciate not wanting to monkey with iter_changes, but the problem
with doing this just up at the commit level is that you have to black
box test. I don't think its safe to put these particular changes into
commit without specific tests.

If you were to put the masking support into iter-changes as a optional
decorating generator (*)you could using test iter-changes quite easily
(using the intertree tests) - if its fast enough for commit, then as you
say its likely fast enough in general, and by preserving the layering we
can optimise later. And unit tests for iter_changes will actually speak
to the core issue rather than circumstance, which is the most a test on
commit itself could do.

e.g.
def iter_changes_exclude_paths(changes, paths)
    for change in changes:
        if change[0][1] is_inside_any(paths):
            continue
    yield change

def iter_changes_include_added_parents(changes, specific_paths):
....

def iter_changes(....)
    result = self._iter_changes(...)
    if specific:
        result = iter_changes_include_added_parents(result, foo)
    if exclude:
        result = iter_changes_exclude_paths(result, foo)
    return result

review disapprove

-Rob

On Tue, 2009-05-12 at 14:05 +0000, Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~bzr/bzr/faster-commit-file
> into lp:bzr.
> 
> Requested reviews:
>     Bazaar Developers (bzr)
> 
> This branch makes commit of selected files on development6-rich-root
> faster than it is on 1.9 format, instead of being twice as slow. All
> tests still pass and the code changes are pretty simple so I think it
> is safe for 1.15rc.
> 
> It's possible to get further improvements here by pushing more
> complexity down into iter_changes as bug #347649 suggests. Given
> selective commit performance is down to 0.27 seconds for Emacs with 3K
> files and 105K revisions, I'm not sure that's necessary yet. That path
> is certainly a lot more work than this patch, given the multiple
> implementations of iter_changes() around the place and the tuning that
> has gone into them to date.

This feels like a bit of a bandaid, and I don't think that the lack of
failures is an indication of correctness - there weren't tests for the
failures that bug 347649 refers to. For instance, simple inspection of
the patch suggests to me that you can miss the root id - e.g. a
pathological case would be a selective first commit with the root
skipped.
 
I appreciate not wanting to monkey with iter_changes, but the problem
with doing this just up at the commit level is that you have to black
box test. I don't think its safe to put these particular changes into
commit without specific tests.

If you were to put the masking support into iter-changes as a optional
decorating generator (*)you could using test iter-changes quite easily
(using the intertree tests) - if its fast enough for commit, then as you
say its likely fast enough in general, and by preserving the layering we
can optimise later. And unit tests for iter_changes will actually speak
to the core issue rather than circumstance, which is the most a test on
commit itself could do.

e.g.
def iter_changes_exclude_paths(changes, paths)
    for change in changes:
        if change[0][1] is_inside_any(paths):
            continue
    yield change

def iter_changes_include_added_parents(changes, specific_paths):
   ....

def iter_changes(....)
    result = self._iter_changes(...)
    if specific:
        result = iter_changes_include_added_parents(result, foo)
    if exclude:
        result = iter_changes_exclude_paths(result, foo)
    return result

review disapprove

-Rob

review: Disapprove

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-13: Posted in a previous version of this proposal

#

Download full text (3.6 KiB)

Robert Collins wrote:
> Review: Disapprove
> On Tue, 2009-05-12 at 14:05 +0000, Ian Clatworthy wrote:
>
>> t's possible to get further improvements here by pushing more
>> complexity down into iter_changes as bug #347649 suggests. Given
>> selective commit performance is down to 0.27 seconds for Emacs with 3K
>> files and 105K revisions, I'm not sure that's necessary yet. That path
>> is certainly a lot more work than this patch, given the multiple
>> implementations of iter_changes() around the place and the tuning that
>> has gone into them to date.
>>
>
> This feels like a bit of a bandaid, and I don't think that the lack of
> failures is an indication of correctness - there weren't tests for the
> failures that bug 347649 refers to. For instance, simple inspection of
> the patch suggests to me that you can miss the root id - e.g. a
> pathological case would be a selective first commit with the root
> skipped.
>
>
I'd expect bzr commit -x '' to be a pointless commit. I suspect my code
will error in exactly that way. It's a *very* pathological case as well:
no sane person would do that on the command line (though it may happen
via the API).
> I appreciate not wanting to monkey with iter_changes, but the problem
> with doing this just up at the commit level is that you have to black
> box test. I don't think its safe to put these particular changes into
> commit without specific tests.
>
>
"bzr selftest commit" runs 2335 tests so I think it's test coverage is
pretty comprehensive. Lots of tests broke during the course of putting
this patch together fwiw, e.g. simply passing specific_files to
iter_changes breaks the test suite so bug 347649 is indeed trapped by tests.

> If you were to put the masking support into iter-changes as a optional
> decorating generator (*)you could using test iter-changes quite easily
> (using the intertree tests) - if its fast enough for commit, then as you
> say its likely fast enough in general, and by preserving the layering we
> can optimise later. And unit tests for iter_changes will actually speak
> to the core issue rather than circumstance, which is the most a test on
> commit itself could do.
>
> e.g.
> def iter_changes_exclude_paths(changes, paths)
> for change in changes:
> if change[0][1] is_inside_any(paths):
> continue
> yield change
>
> def iter_changes_include_added_parents(changes, specific_paths):
> ....
>
>
> def iter_changes(....)
> result = self._iter_changes(...)
> if specific:
> result = iter_changes_include_added_parents(result, foo)
> if exclude:
> result = iter_changes_exclude_paths(result, foo)
> return result
>
>
That still implies adding 2 new options to *multiple* implementations of
iter_changes: yield_parents and exclude_files, say. That's a far bigger
change than this for no additional performance gain. Longer term, I'd be
surprised if pushing exclude_files down could gain us much. Yielding
parents in-situ could be *slightly* faster though but it would take a
large delta before any speed difference was noticed I suspect.
> review disapprove
This patch is an interim step - we really want to leverage the
...

Robert Collins wrote:
> Review: Disapprove
> On Tue, 2009-05-12 at 14:05 +0000, Ian Clatworthy wrote:
>   
>> t's possible to get further improvements here by pushing more
>> complexity down into iter_changes as bug #347649 suggests. Given
>> selective commit performance is down to 0.27 seconds for Emacs with 3K
>> files and 105K revisions, I'm not sure that's necessary yet. That path
>> is certainly a lot more work than this patch, given the multiple
>> implementations of iter_changes() around the place and the tuning that
>> has gone into them to date.
>>     
>
> This feels like a bit of a bandaid, and I don't think that the lack of
> failures is an indication of correctness - there weren't tests for the
> failures that bug 347649 refers to. For instance, simple inspection of
> the patch suggests to me that you can miss the root id - e.g. a
> pathological case would be a selective first commit with the root
> skipped.
>
>   
I'd expect bzr commit -x '' to be a pointless commit. I suspect my code
will error in exactly that way. It's a *very* pathological case as well:
no sane person would do that on the command line (though it may happen
via the API).
> I appreciate not wanting to monkey with iter_changes, but the problem
> with doing this just up at the commit level is that you have to black
> box test. I don't think its safe to put these particular changes into
> commit without specific tests.
>
>   
"bzr selftest commit" runs 2335 tests so I think it's test coverage is
pretty comprehensive. Lots of tests broke during the course of putting
this patch together fwiw, e.g. simply passing specific_files to
iter_changes breaks the test suite so bug 347649 is indeed trapped by tests.

> If you were to put the masking support into iter-changes as a optional
> decorating generator (*)you could using test iter-changes quite easily
> (using the intertree tests) - if its fast enough for commit, then as you
> say its likely fast enough in general, and by preserving the layering we
> can optimise later. And unit tests for iter_changes will actually speak
> to the core issue rather than circumstance, which is the most a test on
> commit itself could do.
>
> e.g.
> def iter_changes_exclude_paths(changes, paths)
>     for change in changes:
>         if change[0][1] is_inside_any(paths):
>             continue
>     yield change
>
> def iter_changes_include_added_parents(changes, specific_paths):
>    ....
>
>
> def iter_changes(....)
>     result = self._iter_changes(...)
>     if specific:
>         result = iter_changes_include_added_parents(result, foo)
>     if exclude:
>         result = iter_changes_exclude_paths(result, foo)
>     return result
>
>   
That still implies adding 2 new options to *multiple* implementations of
iter_changes: yield_parents and exclude_files, say. That's a far bigger
change than this for no additional performance gain. Longer term, I'd be
surprised if pushing exclude_files down could gain us much. Yielding
parents in-situ could be *slightly* faster though but it would take a
large delta before any speed difference was noticed I suspect.
>  review disapprove
This patch is an interim step - we really want to leverage the
specific_files masking at the layer below so someone running 'bzr commit
foo/bar' on the whole Debian tree gets a fast response that this patch
delivers. Having said that, I don't think this patch is a band-aid and
it meets all our criteria for approval: code quality is good, test
coverage is not reduced and performance is improved by a factor of 4(!!)
on a common operation. I think it ought to be landed as a step forward
while I work on a faster solution. But that's just me.

Ian C.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-05-13: Posted in a previous version of this proposal

#

On Wed, 2009-05-13 at 03:33 +0000, Ian Clatworthy wrote:
>
> I'd expect bzr commit -x '' to be a pointless commit. I suspect my
> code
> will error in exactly that way. It's a *very* pathological case as
> well:
> no sane person would do that on the command line (though it may happen
> via the API).

Thats not what I mean. What I mean is
bzr init
bzr add *
bzr commit -m 'foo' README
-> bad data.

> > I appreciate not wanting to monkey with iter_changes, but the
> problem
> > with doing this just up at the commit level is that you have to
> black
> > box test. I don't think its safe to put these particular changes
> into
> > commit without specific tests.
> >
> >
> "bzr selftest commit" runs 2335 tests so I think it's test coverage is
> pretty comprehensive. Lots of tests broke during the course of putting
> this patch together fwiw, e.g. simply passing specific_files to
> iter_changes breaks the test suite so bug 347649 is indeed trapped by
> tests.

That count includes all the micro tests that test specific behaviours of
commit builder. I'm glad that we have accidental coverage of the bug;
I'm still very sure we don't have explicit coverage.

> That still implies adding 2 new options to *multiple* implementations
> of
> iter_changes: yield_parents and exclude_files, say. That's a far
> bigger
> change than this for no additional performance gain. Longer term, I'd
> be
> surprised if pushing exclude_files down could gain us much.

We have the concept that 'diff' and 'status' show us what 'commit' will
do. Until we have commit building on the same logic we'll struggle to
make that a reality. That is another thing that pushing the logic down
will help with.

> Yielding
> parents in-situ could be *slightly* faster though but it would take a
> large delta before any speed difference was noticed I suspect.
> > review disapprove

> This patch is an interim step - we really want to leverage the
> specific_files masking at the layer below so someone running 'bzr
> commit
> foo/bar' on the whole Debian tree gets a fast response that this patch
> delivers. Having said that, I don't think this patch is a band-aid and
> it meets all our criteria for approval: code quality is good, test
> coverage is not reduced and performance is improved by a factor of
> 4(!!)
> on a common operation. I think it ought to be landed as a step forward
> while I work on a faster solution. But that's just me.

Another factor now I think about this is people using
record_iter_changes. The caveats for using it are not really clear. And
having code in commit.py to workaround limits in iter_changes won't help
with that.

I can be persuaded that we should land this as is, but:
- it definitely does not fix the bug report, because the bug is about
   iter_changes
- I think its only a couple of hours work to do it right, within the
   constraints of 'get the interface solid and tested'. I don't see any
   benefit to not doing the work now, while its paged in and in your
mind.

-Rob

On Wed, 2009-05-13 at 03:33 +0000, Ian Clatworthy wrote:
> 
> I'd expect bzr commit -x '' to be a pointless commit. I suspect my
> code
> will error in exactly that way. It's a *very* pathological case as
> well:
> no sane person would do that on the command line (though it may happen
> via the API).

Thats not what I mean. What I mean is
bzr init
bzr add *
bzr commit -m 'foo' README
-> bad data.

> > I appreciate not wanting to monkey with iter_changes, but the
> problem
> > with doing this just up at the commit level is that you have to
> black
> > box test. I don't think its safe to put these particular changes
> into
> > commit without specific tests.
> >
> >   
> "bzr selftest commit" runs 2335 tests so I think it's test coverage is
> pretty comprehensive. Lots of tests broke during the course of putting
> this patch together fwiw, e.g. simply passing specific_files to
> iter_changes breaks the test suite so bug 347649 is indeed trapped by
> tests.

That count includes all the micro tests that test specific behaviours of
commit builder. I'm glad that we have accidental coverage of the bug;
I'm still very sure we don't have explicit coverage.

> That still implies adding 2 new options to *multiple* implementations
> of
> iter_changes: yield_parents and exclude_files, say. That's a far
> bigger
> change than this for no additional performance gain. Longer term, I'd
> be
> surprised if pushing exclude_files down could gain us much.

We have the concept that 'diff' and 'status' show us what 'commit' will
do. Until we have commit building on the same logic we'll struggle to
make that a reality. That is another thing that pushing the logic down
will help with.

>  Yielding
> parents in-situ could be *slightly* faster though but it would take a
> large delta before any speed difference was noticed I suspect.
> >  review disapprove

> This patch is an interim step - we really want to leverage the
> specific_files masking at the layer below so someone running 'bzr
> commit
> foo/bar' on the whole Debian tree gets a fast response that this patch
> delivers. Having said that, I don't think this patch is a band-aid and
> it meets all our criteria for approval: code quality is good, test
> coverage is not reduced and performance is improved by a factor of
> 4(!!)
> on a common operation. I think it ought to be landed as a step forward
> while I work on a faster solution. But that's just me.

Another factor now I think about this is people using
record_iter_changes. The caveats for using it are not really clear. And
having code in commit.py to workaround limits in iter_changes won't help
with that.

I can be persuaded that we should land this as is, but:
 - it definitely does not fix the bug report, because the bug is about
   iter_changes
 - I think its only a couple of hours work to do it right, within the 
   constraints of 'get the interface solid and tested'. I don't see any
   benefit to not doing the work now, while its paged in and in your
mind.

-Rob

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-13: Posted in a previous version of this proposal

#

Resubmitted along the lines discussed on IRC.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-05-15: Posted in a previous version of this proposal

#

Download full text (10.1 KiB)

I've taken a diff against bzr.dev - review follows. Look for ****
introducing my comments

=== modified file 'bzrlib/commit.py'
--- bzrlib/commit.py 2009-04-03 00:07:49 +0000
+++ bzrlib/commit.py 2009-05-15 02:31:26 +0000
@@ -289,12 +289,14 @@
         # We can use record_iter_changes IFF iter_changes is compatible
with
         # the command line parameters, and the repository has fast
delta
         # generation. See bug 347649.
- self.use_record_iter_changes = (
- not self.specific_files and
- not self.exclude and
- not self.branch.repository._format.supports_tree_reference
and
- (self.branch.repository._format.fast_deltas or
- len(self.parents) < 2))
+ if self.branch.repository._format.supports_tree_reference:
+ # TODO: fix this
+ self.use_record_iter_changes = False
+ else:
+ self.use_record_iter_changes = (
+ self.branch.repository._format.fast_deltas
+ or (len(self.parents) < 2 and not self.specific_files
+ and not self.exclude))

**** You've altered this if block substantially, and I think
unintentionally. Please just use the old phrase and remove the two lines
that you're address. *OR* Update the comment to match what your new
if:else: actually does.

@@ -632,13 +634,16 @@
     def _update_builder_with_changes(self):
         """Update the commit builder with the data about what has
changed.
         """
- exclude = self.exclude
- specific_files = self.specific_files or []
- mutter("Selecting files for commit with filter %s",
specific_files)
+ mutter("Selecting files for commit with filter %s excluding %
s",
+ self.specific_files, self.exclude)

         self._check_strict()
         if self.use_record_iter_changes:
             iter_changes = self.work_tree.iter_changes(self.basis_tree)
+ if self.specific_files or self.exclude:
+ iter_changes =
tree.filter_iter_changes_by_paths(iter_changes,
+ self.specific_files, self.exclude,
+ yield_changed_parents=True)

**** I don't think a flag for 'yield_changed_parents' is needed; it
smells of YAGNI to me: its a new interface, and our only user wants
yield-changed-parents always on.

@@ -955,17 +959,16 @@

def _set_specific_file_ids(self):
"""populate self.specific_file_ids if we will use it."""
- if not self.use_record_iter_changes:
- # If provided, ensure the specified files are versioned
- if self.specific_files is not None:
- # Note: This routine is being called because it raises
- # PathNotVersionedError as a side effect of finding the
IDs. We
- # later use the ids we found as input to the working
tree
- # inventory iterator, so we only consider those ids
rather than
- # examining the whole tree again.
- # XXX: Dont we have filter_unversioned to do this more
- # cheaply?
- self.specific_file_ids = tree.find_ids_across_trees(
- self.specific_files,...

I've taken a diff against bzr.dev  - review follows. Look for ****
introducing my comments
 
=== modified file 'bzrlib/commit.py'
--- bzrlib/commit.py	2009-04-03 00:07:49 +0000
+++ bzrlib/commit.py	2009-05-15 02:31:26 +0000
@@ -289,12 +289,14 @@
         # We can use record_iter_changes IFF iter_changes is compatible
with
         # the command line parameters, and the repository has fast
delta
         # generation. See bug 347649.
-        self.use_record_iter_changes = (
-            not self.specific_files and
-            not self.exclude and 
-            not self.branch.repository._format.supports_tree_reference
and
-            (self.branch.repository._format.fast_deltas or
-             len(self.parents) < 2))
+        if self.branch.repository._format.supports_tree_reference:
+            # TODO: fix this
+            self.use_record_iter_changes = False
+        else:
+            self.use_record_iter_changes = (
+                self.branch.repository._format.fast_deltas
+                or (len(self.parents) < 2 and not self.specific_files
+                and not self.exclude))

**** You've altered this if block substantially, and I think
unintentionally. Please just use the old phrase and remove the two lines
that you're address. *OR* Update the comment to match what your new
if:else: actually does.

@@ -632,13 +634,16 @@
     def _update_builder_with_changes(self):
         """Update the commit builder with the data about what has
changed.
         """
-        exclude = self.exclude
-        specific_files = self.specific_files or []
-        mutter("Selecting files for commit with filter %s",
specific_files)
+        mutter("Selecting files for commit with filter %s excluding %
s",
+            self.specific_files, self.exclude)
 
         self._check_strict()
         if self.use_record_iter_changes:
             iter_changes = self.work_tree.iter_changes(self.basis_tree)
+            if self.specific_files or self.exclude:
+                iter_changes =
tree.filter_iter_changes_by_paths(iter_changes,
+                    self.specific_files, self.exclude,
+                    yield_changed_parents=True)

**** I don't think a flag for 'yield_changed_parents' is needed; it
smells of YAGNI to me: its a new interface, and our only user wants
yield-changed-parents always on.

@@ -955,17 +959,16 @@
 
     def _set_specific_file_ids(self):
         """populate self.specific_file_ids if we will use it."""
-        if not self.use_record_iter_changes:
-            # If provided, ensure the specified files are versioned
-            if self.specific_files is not None:
-                # Note: This routine is being called because it raises
-                # PathNotVersionedError as a side effect of finding the
IDs. We
-                # later use the ids we found as input to the working
tree
-                # inventory iterator, so we only consider those ids
rather than
-                # examining the whole tree again.
-                # XXX: Dont we have filter_unversioned to do this more
-                # cheaply?
-                self.specific_file_ids = tree.find_ids_across_trees(
-                    self.specific_files, [self.basis_tree,
self.work_tree])
-            else:
-                self.specific_file_ids = None
+        # If provided, ensure the specified files are versioned
+        if self.specific_files is not None:
+            # Note: This routine is being called because it raises
+            # PathNotVersionedError as a side effect of finding the
IDs. We
+            # later use the ids we found as input to the working tree
+            # inventory iterator, so we only consider those ids rather
than
+            # examining the whole tree again.
+            # XXX: Dont we have filter_unversioned to do this more
+            # cheaply?
+            self.specific_file_ids = tree.find_ids_across_trees(
+                self.specific_files, [self.basis_tree, self.work_tree])
+        else:
+            self.specific_file_ids = None

***** This looks like a regression to me - its duplicating work
iter_changes does. Why are you changing this?

=== modified file 'bzrlib/osutils.py'
--- bzrlib/osutils.py	2009-05-07 04:58:58 +0000
+++ bzrlib/osutils.py	2009-05-15 02:31:26 +0000
@@ -853,6 +853,16 @@
     return pathjoin(*p)
 
 
+def parent_directories(filename):
+    """Return the list of parent directories, deepest first."""
+    parents = []
+    parts = splitpath(dirname(filename))
+    while parts:
+        parents.append(joinpath(parts))
+        parts.pop()
+    return parents

***** This misses the root directory.

=== modified file 'bzrlib/tests/test_osutils.py'
--- bzrlib/tests/test_osutils.py	2009-05-07 05:08:46 +0000
+++ bzrlib/tests/test_osutils.py	2009-05-15 02:31:27 +0000
@@ -860,6 +860,18 @@
         self.assertRaises(errors.BzrError, osutils.splitpath, 'a/../b')
 
 
+class TestParentDirectories(tests.TestCaseInTempDir):
+    """Test osutils.parent_directories()"""
+
+    def test_parent_directories(self):
+        def check(expected, path):
+            self.assertEqual(expected,
osutils.parent_directories(path))
+
+        check([], 'a')
+        check(['a'], 'a/b')
+        check(['a/b', 'a'], 'a/b/c')
+
+

***** Please don't do this; the helper in this case isn't worth the
complexity of having to figure out what 
-check(...)
+check(...)

will mean in a review in 6 months.
Either:
 - use a well named helper
 - just inline the calls
 - use a for loop
 - use test parameterisation

=== modified file 'bzrlib/tests/test_tree.py'
--- bzrlib/tests/test_tree.py	2009-03-23 14:59:43 +0000
+++ bzrlib/tests/test_tree.py	2009-05-15 02:31:27 +0000
@@ -23,7 +23,7 @@
     tree as _mod_tree,
     )
 from bzrlib.tests import TestCaseWithTransport
-from bzrlib.tree import InterTree
+from bzrlib.tree import InterTree, filter_iter_changes_by_paths
 
 
 class TestInterTree(TestCaseWithTransport):
@@ -417,3 +417,84 @@
         self.assertPathToKey(([u''], u'a'), u'a')
         self.assertPathToKey(([u'a'], u'b'), u'a/b')
         self.assertPathToKey(([u'a', u'b'], u'c'), u'a/b/c')
+
+
+class TestFilterIterChangesByPaths(TestCaseWithTransport):
+    """Tests for tree.filter_iter_changes_by_paths()."""

***** Broadly, all the tests below would be much more debuggable as unit
tests rather than black box tests. The bit that is hard to know what it
means is 'original=self.make_changes_iter()'. I can't tell without
running pdb, or writing tests to test the test helper, whether it
actually is missing added roots, for instance.

I think the selection of test cases is too small. Please be sure to:
* test a selected file case where the root is new
* test an exclude case where the root is excluded
* test a selected + exclude case where the exclude is outside the
selected
* test a selected + exclude case where the selected file has been moved
  into the excluded tree
(e.g. old has [README, exclude/], new has [exclude/README],
selected_files=['README'], excludes=['exclude']

There are a number of other permutations that need testing that aren't
tested in your patch. If you need help determining them I'll write them
down, but in the interest of getting you unblocked I'm skipping that for
now. The code that this is layering on is pretty exhaustively tested,
and as we're replacing the use of well tested apis with this new
function we need to test it well too.

+def filter_iter_changes_by_paths(changes, include_files=None,
+    exclude_files=None, yield_changed_parents=False):
+    """Filter the results from iter_changes using paths.
+
+    This decorator is useful for post-processing the output from
+    Tree.iter_changes(). It may also be used for post-procesing the
output
+    from result-compatible methods such as Inventory.iter_changes() and
+    PreviewTree.iter_changes(). Note that specific-path filtering
should not
+    have already been applied.
+
+    :param changes: the iterator of changes to filter
+    :param include_files: paths of files and directories to include or
None
+      for no masking.

**** Please use 'selected_files' as the parameter, for consistency, or
otherwise describe how people should go from a 'selected_files' set to a
'include_files' set. Are they different, the same, what?

Rather than 'no masking', perhaps 'to disable <action>'.

+    :param exclude_files: paths of files and directories to exclude or
None
+      for no masking. Excludes take precedence over includes.
+    :param yield_changed_parents: if True, include changed parent
directories
+      of included paths.

**** As mentioned above, I don't think we have a use case for
'yield-changed-parents=False'.

+    """
+    if not (include_files or exclude_files):
+        for change in changes:
+            yield change

**** This can be optimised out - sketch:
def foo(iterator, *args):
    if can_skip_me:
        return iterator
    else:
        return _foo(iterator, *args)

+    # Find the sets of files to include, additional parents and
excludes,
+    # always including the root in the parents
+    includes_parents = set([''])
+    if include_files:
+        include_set = osutils.minimum_path_selection(include_files)
+        if yield_changed_parents:
+            for include in include_set:
+                for parent in osutils.parent_directories(include):
+                    if parent not in include_set:
+                        includes_parents.add(parent)
+    else:
+        include_set = set([''])

+    if exclude_files:
+        exclude_set = osutils.minimum_path_selection(exclude_files)
+    else:
+        exclude_set = set([])
+
+    for change in changes:
+        # Decide which path to use
+        old_path, new_path = change[1]
+        if new_path is None:
+            path = old_path
+        else:
+            path = new_path
+
+        # Do the filtering
+        if exclude_set and is_inside_any(exclude_set, path):
+            continue
+        elif is_inside_any(include_set, path):
+            yield change
+        elif yield_changed_parents and path in includes_parents:
+            yield change

**** Reading this, I'm not sure how you are getting the deltas for the
parents if they were not selected by iter changes.

Perhaps I"m missing something

Anyhow, this is shaping up well I think.

(I wish I could say 'review resubmit')
 review disapprove

review: Disapprove

Revision history for this message

Robert Collins (lifeless) wrote on 2009-05-15: Posted in a previous version of this proposal

#

> **** Reading this, I'm not sure how you are getting the deltas for the
> parents if they were not selected by iter changes.
>
> Perhaps I"m missing something

(but I don't think I am - I've checked back against the previous diff
and this isn't a change from the refactoring; it was present in that
version as well.)

-Rob

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-15: Posted in a previous version of this proposal

#

Download full text (6.7 KiB)

Follow-up IRC discussion re review and next steps ...

(14:46:07) igc: lifeless: so, I might be stupid but I still don't see the problem ...
(14:46:13) lifeless: ok
(14:46:24) igc: iter_changes only needs to return changed parents, not all of them
(14:46:46) lifeless: right, but unless its looking at parents it won't return them, because iter_changes has a bug
(14:46:52) igc: i.e. if a parent isn't in iter_changes, then it isn't needed in apply_delta IIUIC
(14:47:22) igc: lifeless: but it *is* looking at parent right now
(14:47:27) lifeless: how?
(14:47:33) igc: it looks at the whole tree, then post-filters
(14:47:59) lifeless: ow
(14:48:17) lifeless: I'm really torn
(14:48:17) igc: and osutils.parent_directories doesn't need to return the root
(14:48:31) igc: at the os level, '' doesn't make sense
(14:48:52) lifeless: well, other code would be cleaner if it does - as you are making sure you have [''] always
(14:49:05) igc: also, my change to when record_iter_changes was deliberate ...
(14:49:15) igc: and the comment still hold exactly as is btw
(14:49:26) igc: I benchmarked to decide when it was the right thing fwiw
(14:51:24) lifeless: why the (len(self.parents) < 2 and not self.specific_files and not self.exclude)
(14:51:27) lifeless: clause?
(14:51:46) igc: because for non-chk formats, ...
(14:52:21) igc: it's slower than the current code otherwise
(14:52:29) igc: at least on Emacs
(14:53:17) igc: I'm happy to extend the comment along those lines
(14:53:42) lifeless: So this means that the self.specific_files and self.exclude processing with record_iter_changes is _slower_ than the old full-tree code path
(14:54:02) lifeless: and that its only faster with record_iter_changes on chk repositoreis because the chk code is able to compensate
(14:54:13) lifeless: I appreciate that you've got it going faster
(14:54:17) igc: yes, according to my measurements
(14:54:47) lifeless: I'm really worried that the other changes, such as the reinstance of files_across_trees are going to make it harder for someone to come bacj and fix bug 347649 properly
(14:54:48) igc: it was consistently slower - 08. vs 04 or something like that
(14:54:48) ubottu: Launchpad bug 347649 in bzr "iter_changes missing support needed for commit" [High,Confirmed] https://launchpad.net/bugs/347649
(14:54:56) lifeless: as they will have to basically undo this patch to fix it
(14:55:28) lifeless: on the other hand I don't want to block performance improvements
(14:55:35) igc: lifeless: I did look at trying to remove the file_across_trees bit but ...
(14:55:44) igc: it broke the test suite cos ...
(14:55:45) lifeless: do you have any suggestion about how we can do both things?
(14:55:56) igc: we trap for bogus include paths
(14:56:06) igc: and iter_changes doesn't see those
(14:56:21) lifeless: file_across_trees is a problem because it upcassts the dirstate to an inventory - its spectcularly slow
(14:56:44) lifeless: emacs is what 20K files?
(14:56:59) igc: just checking, we're talking about the _set_specific_file_ids() method yes?
(14:57:06) lifeless: yes
(14:57:25) igc: emacs is 8K from memory
(14:57:31) igc: 100K revisions but ...
(14:57:38) igc: that doesn't matter myuch here
(...

Follow-up IRC discussion re review and next steps ...

(14:46:07) igc: lifeless: so, I might be stupid but I still don't see the problem ...
(14:46:13) lifeless: ok
(14:46:24) igc: iter_changes only needs to return changed parents, not all of them
(14:46:46) lifeless: right, but unless its looking at parents it won't return them, because iter_changes has a bug
(14:46:52) igc: i.e. if a parent isn't in iter_changes, then it isn't needed in apply_delta IIUIC
(14:47:22) igc: lifeless: but it *is* looking at parent right now
(14:47:27) lifeless: how?
(14:47:33) igc: it looks at the whole tree, then post-filters
(14:47:59) lifeless: ow
(14:48:17) lifeless: I'm really torn
(14:48:17) igc: and osutils.parent_directories doesn't need to return the root
(14:48:31) igc: at the os level, '' doesn't make sense
(14:48:52) lifeless: well, other code would be cleaner if it does - as you are making sure you have [''] always
(14:49:05) igc: also, my change to when record_iter_changes was deliberate ...
(14:49:15) igc: and the comment still hold exactly as is btw
(14:49:26) igc: I benchmarked to decide when it was the right thing fwiw
(14:51:24) lifeless: why the (len(self.parents) < 2 and not self.specific_files and not self.exclude)
(14:51:27) lifeless: clause?
(14:51:46) igc: because for non-chk formats, ...
(14:52:21) igc: it's slower than the current code otherwise
(14:52:29) igc: at least on Emacs
(14:53:17) igc: I'm happy to extend the comment along those lines
(14:53:42) lifeless: So this means that the self.specific_files and self.exclude processing with record_iter_changes is _slower_ than the old full-tree code path
(14:54:02) lifeless: and that its only faster with record_iter_changes on chk repositoreis because the chk code is able to compensate
(14:54:13) lifeless: I appreciate that you've got it going faster
(14:54:17) igc: yes, according to my measurements
(14:54:47) lifeless: I'm really worried that the other changes, such as the reinstance of files_across_trees are going to make it harder for someone to come bacj and fix bug 347649 properly
(14:54:48) igc: it was consistently slower - 08. vs 04 or something like that
(14:54:48) ubottu: Launchpad bug 347649 in bzr "iter_changes missing support needed for commit" [High,Confirmed] https://launchpad.net/bugs/347649
(14:54:56) lifeless: as they will have to basically undo this patch to fix it
(14:55:28) lifeless: on the other hand I don't want to block performance improvements
(14:55:35) igc: lifeless: I did look at trying to remove the file_across_trees bit but ...
(14:55:44) igc: it broke the test suite cos ...
(14:55:45) lifeless: do you have any suggestion about how we can do both things?
(14:55:56) igc: we trap for bogus include paths
(14:56:06) igc: and iter_changes doesn't see those
(14:56:21) lifeless: file_across_trees is a problem because it upcassts the dirstate to an inventory - its spectcularly slow
(14:56:44) lifeless: emacs is what 20K files?
(14:56:59) igc: just checking, we're talking about the _set_specific_file_ids() method yes?
(14:57:06) lifeless: yes
(14:57:25) igc: emacs is 8K from memory
(14:57:31) igc: 100K revisions but ...
(14:57:38) igc: that doesn't matter myuch here
(14:57:42) lifeless: ok, so 16K inventory entries
(14:58:19) lifeless: I don't feel good about this code
(14:58:57) lifeless: so I'm going to recuse my self as a review at this point; I don't want to block you, and I haven't managed to open your eyes to why the layering is a problem or matters so much
(14:59:25) lifeless: I'll be effectively offline for 2 weeks, and I don't like the idea of you being blocked for that long.
(14:59:34) igc: I see and agree with the desire for one iter_changes call covering ...
(14:59:50) igc: legal filename checking, strict checking and collecting results
(15:00:49) igc: but I don't see it being necessary to get there in a single step
(15:01:20) lifeless: I bet you cold cache specific file commits will be way slower with your patch
(15:01:39) lifeless: particularly on the large trees I've been working up to - 50K and 100K files
(15:02:00) igc: slower than now or slower than possible?
(15:02:08) lifeless: slower than now
(15:02:25) lifeless: I could be wrong
(15:03:57) lifeless: but what you trade off is some unoptimised code (CHKRepository.add_inventory) for some optimised code that does a lot of IO (WT.iter_changes of everything)
(15:04:49) igc: I'll throw together the OOo tree and benchmark on it
(15:04:52) mtaylor [n=mtaylor@74-61-48-25.sea.clearwire-dns.net] entered the room.
(15:05:04) lifeless: cold cache will be key to see the disk IO impact
(15:05:18) lifeless: I can suggest a completely different approach if you just want to make it faster in the interim
(15:05:22) igc: lifeless: design wise, I have a question ...
(15:05:39) igc: if we delegate specific_file prcoessing to iter_changes ...
(15:05:40) lifeless: (an approach I wouldn't feel unhappy about)
(15:06:02) igc: how we we accurately insert the parents after the fact?
(15:06:07) igc: s/we/can/
(15:06:16) igc: we don't have all the info?
(15:06:21) lifeless: igc: iter_changes is allowed to jump around
(15:06:37) lifeless: igc: it already does this for renames
(15:07:05) lifeless: iter_changes has built into it the logic of find_ids_across_trees
(15:07:13) lifeless: anyway
(15:07:23) lifeless: if you want something easy
(15:07:34) lifeless: add a parameter to finish_inventory called 'basis_inventory'
(15:07:45) lifeless: and if use_record_iter_changes is False in commit
(15:07:55) lifeless: pass self.basis_inventory to finish_inventory
(15:08:05) lifeless: and in finish_inventory do:
(15:08:44) lifeless: if self.repository._format._commit_inv_deltas
(15:08:57) lifeless: delta = self.new_inventory._make_delta(basis_inventory)
(15:09:10) lifeless: self.repository.add_inventory_by_delta(basis_inventory, basis_inventory.revision_id)
(15:09:14) lifeless: ^
(15:09:45) lifeless: this will work around finish_inventory being overly slow in CHK repositories when not using record_iter_changes
(15:10:17) lifeless: and as you'll already have a basis_inventory it should be nigh free
(15:11:23) igc: lifeless: ok, I'll play with that next week and try it on some larger data sets
(15:12:32) lifeless: igc: I would expect that to be ~= to what you have today, and deal with cold cache and other situations much better
(15:12:42) lifeless: no where near as fast as fixing iter_changes
(15:12:57) lifeless: I would expect a fixed iter_changes to smoke everything
(15:14:06) lifeless: as for what to do in iter_changes to handle parent dirs
(15:14:19) lifeless: I would keep a minimal set of directory items output
(15:14:31) lifeless: and a 'we need a parent of X' queue
(15:14:47) lifeless: at the end of the normal loop, if there 'we need a parent of X' is non-empty
(15:15:13) lifeless: we do the normal loop on just the missing parents
(15:15:17) lifeless: without recursion

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-06-25:

#

This is effectively a resubmission of my earlier patch 6 weeks ago, with some minor clean-ups. While I agree that we can do much better than this by pushing more functionality down into iter_changes, I still believe this is a safe and valuable step forward. For example, usertest results (run tonight) on the 'selective commit' task comparing bzr.dev r4476 to r4476+patch shows an improvement from 73.4 seconds to 3.0 seconds for OOo on format 2a. In other words, this patch improves performance from "absolutely terrible on large projects" to "acceptable for most users".

FWIW, I did spent quite a bit of time over two weeks trying a different approach, namely smarter use of iter_changes - doing include filtering there - plus decorators over those results. That proved to be a dead-end because of safety concerns: race-conditions around dirstate locking mean wrong results are possible. In comparison, this patch is safe as best I can tell.

I'll leave it to others to decide whether this is enough progress on this problem for us to ship 2.0. At a minimum, I think it changes the bug priority from Critical to something less.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-25:

#

Do you have cold cache (hot bzr, cold tree) performance data for it?

I ask because from my reading of the patch, it could well be worse for
ooo in that [not uncommon] scenario.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2009-07-02:

#

This can generate bad commit deltas too; its the primary thing that is making my doing the changes at the right level problematic as well.

While I don't object in principle to layering like this if it is faster, correctness in the commit code path has to be a high priority.

I have a fix-in-iter-changes working, I'm currently focused on examining all the cases that determine whether the code can product bad deltas that will be accepted silently, corrupting repositories.

review: Needs Fixing

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-08-19:

#

This has been superceded by Robert's work so I'll change the status to Rejected.

Bazaar

Merge lp:~bzr/bzr/faster-commit-file into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'NEWS'
 --- NEWS	2009-08-18 20:05:30 +0000
 +++ NEWS	2009-08-19 02:39:06 +0000
@@ -595,6 +595,7 @@
    or ``bzr+ssh://`` is now much faster and involves no VFS operations.
    This speeds up commands like ``bzr pull -r 123``.  (Andrew Bennetts)
++<<<<<<< TREE
  * ``revision-info`` now properly aligns the revnos/revids in the output
    and doesn't traceback when given revisions not in the current branch.
    Performance is also significantly improved when requesting multiple revs
@@ -603,6 +604,11 @@
  * Tildes are no longer escaped by Transports. (Andy Kilner)
++=======
++* Selective commit performance on ``2a`` is now better than it is on
++  ``1.9`` format. (Ian Clatworthy)
++
++>>>>>>> MERGE-SOURCE
  Documentation
  *************
 === modified file 'bzrlib/commit.py'
 --- bzrlib/commit.py	2009-07-15 05:54:37 +0000
 +++ bzrlib/commit.py	2009-08-19 02:39:06 +0000
@@ -284,12 +284,14 @@
          # We can use record_iter_changes IFF iter_changes is compatible with
          # the command line parameters, and the repository has fast delta
          # generation. See bug 347649.
--        self.use_record_iter_changes = (
--            not self.specific_files and
--            not self.exclude and
--            not self.branch.repository._format.supports_tree_reference and
--            (self.branch.repository._format.fast_deltas or
--             len(self.parents) < 2))
++        if self.branch.repository._format.supports_tree_reference:
++            # TODO: fix this
++            self.use_record_iter_changes = False
++        else:
++            self.use_record_iter_changes = \
++                self.branch.repository._format.fast_deltas
++                #or (len(self.parents) < 2 and not self.specific_files
++                #and not self.exclude))
          self.pb = bzrlib.ui.ui_factory.nested_progress_bar()
          self.basis_revid = self.work_tree.last_revision()
          self.basis_tree = self.work_tree.basis_tree()
@@ -580,10 +582,10 @@
              except Exception, e:
                  found_exception = e
          if found_exception is not None:
--            # don't do a plan raise, because the last exception may have been
++            # don't do a plain raise, because the last exception may have been
              # trashed, e is our sure-to-work exception even though it loses the
              # full traceback. XXX: RBC 20060421 perhaps we could check the
--            # exc_info and if its the same one do a plain raise otherwise
++            # exc_info and if it's the same one do a plain raise otherwise
              # 'raise e' as we do now.
              raise e
@@ -618,13 +620,16 @@
      def _update_builder_with_changes(self):
          """Update the commit builder with the data about what has changed.
          """
--        exclude = self.exclude
--        specific_files = self.specific_files or []
--        mutter("Selecting files for commit with filter %s", specific_files)
++        mutter("Selecting files for commit with filter %s excluding %s",
++            self.specific_files, self.exclude)
          self._check_strict()
          if self.use_record_iter_changes:
              iter_changes = self.work_tree.iter_changes(self.basis_tree)
++            if self.specific_files or self.exclude:
++                iter_changes = tree._filter_iter_changes_by_paths(iter_changes,
++                    self.specific_files, self.exclude,
++                    yield_changed_parents=True)
              iter_changes = self._filter_iter_changes(iter_changes)
              for file_id, path, fs_hash in self.builder.record_iter_changes(
                  self.work_tree, self.basis_revid, iter_changes):
@@ -640,7 +645,7 @@
          This method reports on the changes in iter_changes to the user, and
          converts 'missing' entries in the iter_changes iterator to 'deleted'
--        entries. 'missing' entries have their
++        entries.
          :param iter_changes: An iter_changes to process.
          :return: A generator of changes.
@@ -652,7 +657,6 @@
              if report_changes:
                  old_path = change[1][0]
                  new_path = change[1][1]
--                versioned = change[3][1]
              kind = change[6][1]
              versioned = change[3][1]
              if kind is None and versioned:
@@ -941,17 +945,16 @@
      def _set_specific_file_ids(self):
          """populate self.specific_file_ids if we will use it."""
--        if not self.use_record_iter_changes:
--            # If provided, ensure the specified files are versioned
--            if self.specific_files is not None:
--                # Note: This routine is being called because it raises
--                # PathNotVersionedError as a side effect of finding the IDs. We
--                # later use the ids we found as input to the working tree
--                # inventory iterator, so we only consider those ids rather than
--                # examining the whole tree again.
--                # XXX: Dont we have filter_unversioned to do this more
--                # cheaply?
--                self.specific_file_ids = tree.find_ids_across_trees(
--                    self.specific_files, [self.basis_tree, self.work_tree])
--            else:
--                self.specific_file_ids = None
++        # If provided, ensure the specified files are versioned
++        if self.specific_files is not None:
++            # Note: This routine is being called because it raises
++            # PathNotVersionedError as a side effect of finding the IDs. We
++            # later use the ids we found as input to the working tree
++            # inventory iterator, so we only consider those ids rather than
++            # examining the whole tree again.
++            # XXX: Dont we have filter_unversioned to do this more
++            # cheaply?
++            self.specific_file_ids = tree.find_ids_across_trees(
++                self.specific_files, [self.basis_tree, self.work_tree])
++        else:
++            self.specific_file_ids = None
 === modified file 'bzrlib/repository.py'
 --- bzrlib/repository.py	2009-08-17 23:15:55 +0000
 +++ bzrlib/repository.py	2009-08-19 02:39:06 +0000
@@ -558,7 +558,7 @@
          :param iter_changes: An iter_changes iterator with the changes to apply
              to basis_revision_id. The iterator must not include any items with
              a current kind of None - missing items must be either filtered out
--            or errored-on beefore record_iter_changes sees the item.
++            or errored-on before record_iter_changes sees the item.
          :param _entry_factory: Private method to bind entry_factory locally for
              performance.
          :return: A generator of (file_id, relpath, fs_hash) tuples for use with
 === modified file 'bzrlib/tests/test_tree.py'
 --- bzrlib/tests/test_tree.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/tests/test_tree.py	2009-08-19 02:39:06 +0000
@@ -23,7 +23,7 @@
      tree as _mod_tree,
+     )
  from bzrlib.tests import TestCaseWithTransport
--from bzrlib.tree import InterTree
++from bzrlib.tree import InterTree, _filter_iter_changes_by_paths
  class TestInterTree(TestCaseWithTransport):
@@ -417,3 +417,84 @@
          self.assertPathToKey(([u''], u'a'), u'a')
          self.assertPathToKey(([u'a'], u'b'), u'a/b')
          self.assertPathToKey(([u'a', u'b'], u'c'), u'a/b/c')
++
++
++class TestFilterIterChangesByPaths(TestCaseWithTransport):
++    """Tests for tree._filter_iter_changes_by_paths()."""
++
++    def make_changes_iter(self):
++        wt = self.make_branch_and_tree('.')
++        b = wt.branch
++        self.build_tree(['foo/', 'foo/foo1', 'bar/', 'bar/bar1', 'bar/bar2',
++            'baz'])
++        wt.add(['foo', 'foo/foo1', 'bar', 'bar/bar1', 'bar/bar2', 'baz'],
++              ['foo-id', 'foo1-id', 'bar-id', 'bar1-id', 'bar2-id', 'baz-id'])
++        wt.commit('bar/bar1', specific_files=['bar/bar1'], rev_id='1')
++        basis = wt.basis_tree()
++        wt.lock_read()
++        self.addCleanup(wt.unlock)
++        basis.lock_read()
++        self.addCleanup(basis.unlock)
++        return wt.iter_changes(basis)
++
++    def test_no_filtering(self):
++        original = self.make_changes_iter()
++        original_list = list(original)
++        filtered = _filter_iter_changes_by_paths(iter(original_list))
++        self.assertEqual(original_list, list(filtered))
++
++    def test_include_files(self):
++        original = self.make_changes_iter()
++        filtered = _filter_iter_changes_by_paths(original,
++            include_files=['bar'])
++        self.assertEqual([
++            ('bar2-id', (None, u'bar/bar2'), True, (False, True),
++            (None, 'bar-id'), (None, u'bar2'), (None, 'file'), (None, 0)),
++            ], list(filtered))
++
++    def test_include_files_yielding_changed_parents(self):
++        original = self.make_changes_iter()
++        filtered = _filter_iter_changes_by_paths(original,
++            include_files=['bar'], yield_changed_parents=True)
++        self.assertEqual([
++            ('bar2-id', (None, u'bar/bar2'), True, (False, True),
++            (None, 'bar-id'), (None, u'bar2'), (None, 'file'), (None, 0)),
++            ], list(filtered))
++
++    def test_include_files_yielding_changed_parents2(self):
++        original = self.make_changes_iter()
++        filtered = _filter_iter_changes_by_paths(original,
++            include_files=['foo'], yield_changed_parents=True)
++        self.assertEqual([
++            ('foo-id', (None, u'foo'), True, (False, True),
++            (None, 'TREE_ROOT'), (None, u'foo'), (None, 'directory'), (None, 0)),
++            ('foo1-id', (None, u'foo/foo1'), True, (False, True),
++            (None, 'foo-id'), (None, u'foo1'), (None, 'file'), (None, 0)),
++            ], list(filtered))
++
++    def test_exclude_files(self):
++        original = self.make_changes_iter()
++        filtered = _filter_iter_changes_by_paths(original,
++            exclude_files=['foo'])
++        self.assertEqual([
++            ('baz-id', (None, u'baz'), True, (False, True),
++            (None, 'TREE_ROOT'), (None, u'baz'), (None, 'file'), (None, 0)),
++            ('bar2-id', (None, u'bar/bar2'), True, (False, True),
++            (None, 'bar-id'), (None, u'bar2'), (None, 'file'), (None, 0)),
++            ], list(filtered))
++        pass
++
++    def test_excludes_override_includes(self):
++        original = self.make_changes_iter()
++        filtered = _filter_iter_changes_by_paths(original,
++            include_files=['bar/bar2'], exclude_files=['bar'])
++        self.assertEqual([], list(filtered))
++
++    def test_excludes_override_includes2(self):
++        original = self.make_changes_iter()
++        filtered = _filter_iter_changes_by_paths(original,
++            include_files=['foo'], exclude_files=['foo/foo1'])
++        self.assertEqual([
++            ('foo-id', (None, u'foo'), True, (False, True),
++            (None, 'TREE_ROOT'), (None, u'foo'), (None, 'directory'), (None, 0)),
++            ], list(filtered))
 === modified file 'bzrlib/tree.py'
 --- bzrlib/tree.py	2009-07-17 06:04:35 +0000
 +++ bzrlib/tree.py	2009-08-19 02:39:06 +0000
@@ -35,7 +35,7 @@
  from bzrlib import errors
  from bzrlib.inventory import InventoryFile
  from bzrlib.inter import InterObject
--from bzrlib.osutils import fingerprint_file
++from bzrlib.osutils import fingerprint_file, is_inside_any
  import bzrlib.revision
  from bzrlib.symbol_versioning import deprecated_function, deprecated_in
  from bzrlib.trace import note
@@ -1289,3 +1289,59 @@
                      other_values.append(self._lookup_by_file_id(
                                              alt_extra, alt_tree, file_id))
                  yield other_path, file_id, None, other_values
++
++
++def _filter_iter_changes_by_paths(changes, include_files=None,
++    exclude_files=None, yield_changed_parents=False):
++    """Filter the results from iter_changes using paths.
++
++    This decorator is useful for post-processing the output from
++    Tree.iter_changes(). It may also be used for post-procesing the output
++    from result-compatible methods such as Inventory.iter_changes() and
++    PreviewTree.iter_changes(). Note that specific-path filtering should not
++    have already been applied.
++
++    :param changes: the iterator of changes to filter
++    :param include_files: paths of files and directories to include or None
++      for no masking.
++    :param exclude_files: paths of files and directories to exclude or None
++      for no masking. Excludes take precedence over includes.
++    :param yield_changed_parents: if True, include changed parent directories
++      of included paths.
++    """
++    if not (include_files or exclude_files):
++        for change in changes:
++            yield change
++
++    # Find the sets of files to include, additional parents and excludes,
++    # always including the root in the parents
++    includes_parents = set([''])
++    if include_files:
++        include_set = osutils.minimum_path_selection(include_files)
++        if yield_changed_parents:
++            for include in include_set:
++                for parent in osutils.parent_directories(include):
++                    if parent not in include_set:
++                        includes_parents.add(parent)
++    else:
++        include_set = set([''])
++    if exclude_files:
++        exclude_set = osutils.minimum_path_selection(exclude_files)
++    else:
++        exclude_set = set([])
++
++    for change in changes:
++        # Decide which path to use
++        old_path, new_path = change[1]
++        if new_path is None:
++            path = old_path
++        else:
++            path = new_path
++
++        # Do the filtering
++        if exclude_set and is_inside_any(exclude_set, path):
++            continue
++        elif is_inside_any(include_set, path):
++            yield change
++        elif yield_changed_parents and path in includes_parents:
++            yield change