Merge lp:~bzr/bzr/faster-commit-file into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Rejected
Rejected by: Ian Clatworthy
Proposed branch: lp:~bzr/bzr/faster-commit-file
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 313 lines (has conflicts)
Text conflict in NEWS
To merge this branch: bzr merge lp:~bzr/bzr/faster-commit-file
Reviewer Review Type Date Requested Status
Robert Collins (community) Needs Fixing
Review via email: mp+7885@code.launchpad.net

This proposal supersedes a proposal from 2009-05-12.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

This branch makes commit of selected files on development6-rich-root faster than it is on 1.9 format, instead of being twice as slow. All tests still pass and the code changes are pretty simple so I think it is safe for 1.15rc.

It's possible to get further improvements here by pushing more complexity down into iter_changes as bug #347649 suggests. Given selective commit performance is down to 0.27 seconds for Emacs with 3K files and 105K revisions, I'm not sure that's necessary yet. That path is certainly a lot more work than this patch, given the multiple implementations of iter_changes() around the place and the tuning that has gone into them to date.

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

On Tue, 2009-05-12 at 14:05 +0000, Ian Clatworthy wrote:
> Ian Clatworthy has proposed merging lp:~bzr/bzr/faster-commit-file
> into lp:bzr.
>
> Requested reviews:
> Bazaar Developers (bzr)
>
> This branch makes commit of selected files on development6-rich-root
> faster than it is on 1.9 format, instead of being twice as slow. All
> tests still pass and the code changes are pretty simple so I think it
> is safe for 1.15rc.
>
> It's possible to get further improvements here by pushing more
> complexity down into iter_changes as bug #347649 suggests. Given
> selective commit performance is down to 0.27 seconds for Emacs with 3K
> files and 105K revisions, I'm not sure that's necessary yet. That path
> is certainly a lot more work than this patch, given the multiple
> implementations of iter_changes() around the place and the tuning that
> has gone into them to date.

This feels like a bit of a bandaid, and I don't think that the lack of
failures is an indication of correctness - there weren't tests for the
failures that bug 347649 refers to. For instance, simple inspection of
the patch suggests to me that you can miss the root id - e.g. a
pathological case would be a selective first commit with the root
skipped.

I appreciate not wanting to monkey with iter_changes, but the problem
with doing this just up at the commit level is that you have to black
box test. I don't think its safe to put these particular changes into
commit without specific tests.

If you were to put the masking support into iter-changes as a optional
decorating generator (*)you could using test iter-changes quite easily
(using the intertree tests) - if its fast enough for commit, then as you
say its likely fast enough in general, and by preserving the layering we
can optimise later. And unit tests for iter_changes will actually speak
to the core issue rather than circumstance, which is the most a test on
commit itself could do.

e.g.
def iter_changes_exclude_paths(changes, paths)
    for change in changes:
        if change[0][1] is_inside_any(paths):
            continue
    yield change

def iter_changes_include_added_parents(changes, specific_paths):
   ....

def iter_changes(....)
    result = self._iter_changes(...)
    if specific:
        result = iter_changes_include_added_parents(result, foo)
    if exclude:
        result = iter_changes_exclude_paths(result, foo)
    return result

 review disapprove

-Rob

review: Disapprove
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal
Download full text (3.6 KiB)

Robert Collins wrote:
> Review: Disapprove
> On Tue, 2009-05-12 at 14:05 +0000, Ian Clatworthy wrote:
>
>> t's possible to get further improvements here by pushing more
>> complexity down into iter_changes as bug #347649 suggests. Given
>> selective commit performance is down to 0.27 seconds for Emacs with 3K
>> files and 105K revisions, I'm not sure that's necessary yet. That path
>> is certainly a lot more work than this patch, given the multiple
>> implementations of iter_changes() around the place and the tuning that
>> has gone into them to date.
>>
>
> This feels like a bit of a bandaid, and I don't think that the lack of
> failures is an indication of correctness - there weren't tests for the
> failures that bug 347649 refers to. For instance, simple inspection of
> the patch suggests to me that you can miss the root id - e.g. a
> pathological case would be a selective first commit with the root
> skipped.
>
>
I'd expect bzr commit -x '' to be a pointless commit. I suspect my code
will error in exactly that way. It's a *very* pathological case as well:
no sane person would do that on the command line (though it may happen
via the API).
> I appreciate not wanting to monkey with iter_changes, but the problem
> with doing this just up at the commit level is that you have to black
> box test. I don't think its safe to put these particular changes into
> commit without specific tests.
>
>
"bzr selftest commit" runs 2335 tests so I think it's test coverage is
pretty comprehensive. Lots of tests broke during the course of putting
this patch together fwiw, e.g. simply passing specific_files to
iter_changes breaks the test suite so bug 347649 is indeed trapped by tests.

> If you were to put the masking support into iter-changes as a optional
> decorating generator (*)you could using test iter-changes quite easily
> (using the intertree tests) - if its fast enough for commit, then as you
> say its likely fast enough in general, and by preserving the layering we
> can optimise later. And unit tests for iter_changes will actually speak
> to the core issue rather than circumstance, which is the most a test on
> commit itself could do.
>
> e.g.
> def iter_changes_exclude_paths(changes, paths)
> for change in changes:
> if change[0][1] is_inside_any(paths):
> continue
> yield change
>
> def iter_changes_include_added_parents(changes, specific_paths):
> ....
>
>
> def iter_changes(....)
> result = self._iter_changes(...)
> if specific:
> result = iter_changes_include_added_parents(result, foo)
> if exclude:
> result = iter_changes_exclude_paths(result, foo)
> return result
>
>
That still implies adding 2 new options to *multiple* implementations of
iter_changes: yield_parents and exclude_files, say. That's a far bigger
change than this for no additional performance gain. Longer term, I'd be
surprised if pushing exclude_files down could gain us much. Yielding
parents in-situ could be *slightly* faster though but it would take a
large delta before any speed difference was noticed I suspect.
> review disapprove
This patch is an interim step - we really want to leverage the
...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

On Wed, 2009-05-13 at 03:33 +0000, Ian Clatworthy wrote:
>
> I'd expect bzr commit -x '' to be a pointless commit. I suspect my
> code
> will error in exactly that way. It's a *very* pathological case as
> well:
> no sane person would do that on the command line (though it may happen
> via the API).

Thats not what I mean. What I mean is
bzr init
bzr add *
bzr commit -m 'foo' README
-> bad data.

> > I appreciate not wanting to monkey with iter_changes, but the
> problem
> > with doing this just up at the commit level is that you have to
> black
> > box test. I don't think its safe to put these particular changes
> into
> > commit without specific tests.
> >
> >
> "bzr selftest commit" runs 2335 tests so I think it's test coverage is
> pretty comprehensive. Lots of tests broke during the course of putting
> this patch together fwiw, e.g. simply passing specific_files to
> iter_changes breaks the test suite so bug 347649 is indeed trapped by
> tests.

That count includes all the micro tests that test specific behaviours of
commit builder. I'm glad that we have accidental coverage of the bug;
I'm still very sure we don't have explicit coverage.

> That still implies adding 2 new options to *multiple* implementations
> of
> iter_changes: yield_parents and exclude_files, say. That's a far
> bigger
> change than this for no additional performance gain. Longer term, I'd
> be
> surprised if pushing exclude_files down could gain us much.

We have the concept that 'diff' and 'status' show us what 'commit' will
do. Until we have commit building on the same logic we'll struggle to
make that a reality. That is another thing that pushing the logic down
will help with.

> Yielding
> parents in-situ could be *slightly* faster though but it would take a
> large delta before any speed difference was noticed I suspect.
> > review disapprove

> This patch is an interim step - we really want to leverage the
> specific_files masking at the layer below so someone running 'bzr
> commit
> foo/bar' on the whole Debian tree gets a fast response that this patch
> delivers. Having said that, I don't think this patch is a band-aid and
> it meets all our criteria for approval: code quality is good, test
> coverage is not reduced and performance is improved by a factor of
> 4(!!)
> on a common operation. I think it ought to be landed as a step forward
> while I work on a faster solution. But that's just me.

Another factor now I think about this is people using
record_iter_changes. The caveats for using it are not really clear. And
having code in commit.py to workaround limits in iter_changes won't help
with that.

I can be persuaded that we should land this as is, but:
 - it definitely does not fix the bug report, because the bug is about
   iter_changes
 - I think its only a couple of hours work to do it right, within the
   constraints of 'get the interface solid and tested'. I don't see any
   benefit to not doing the work now, while its paged in and in your
mind.

-Rob

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Resubmitted along the lines discussed on IRC.

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal
Download full text (10.1 KiB)

I've taken a diff against bzr.dev - review follows. Look for ****
introducing my comments

=== modified file 'bzrlib/commit.py'
--- bzrlib/commit.py 2009-04-03 00:07:49 +0000
+++ bzrlib/commit.py 2009-05-15 02:31:26 +0000
@@ -289,12 +289,14 @@
         # We can use record_iter_changes IFF iter_changes is compatible
with
         # the command line parameters, and the repository has fast
delta
         # generation. See bug 347649.
- self.use_record_iter_changes = (
- not self.specific_files and
- not self.exclude and
- not self.branch.repository._format.supports_tree_reference
and
- (self.branch.repository._format.fast_deltas or
- len(self.parents) < 2))
+ if self.branch.repository._format.supports_tree_reference:
+ # TODO: fix this
+ self.use_record_iter_changes = False
+ else:
+ self.use_record_iter_changes = (
+ self.branch.repository._format.fast_deltas
+ or (len(self.parents) < 2 and not self.specific_files
+ and not self.exclude))

**** You've altered this if block substantially, and I think
unintentionally. Please just use the old phrase and remove the two lines
that you're address. *OR* Update the comment to match what your new
if:else: actually does.

@@ -632,13 +634,16 @@
     def _update_builder_with_changes(self):
         """Update the commit builder with the data about what has
changed.
         """
- exclude = self.exclude
- specific_files = self.specific_files or []
- mutter("Selecting files for commit with filter %s",
specific_files)
+ mutter("Selecting files for commit with filter %s excluding %
s",
+ self.specific_files, self.exclude)

         self._check_strict()
         if self.use_record_iter_changes:
             iter_changes = self.work_tree.iter_changes(self.basis_tree)
+ if self.specific_files or self.exclude:
+ iter_changes =
tree.filter_iter_changes_by_paths(iter_changes,
+ self.specific_files, self.exclude,
+ yield_changed_parents=True)

**** I don't think a flag for 'yield_changed_parents' is needed; it
smells of YAGNI to me: its a new interface, and our only user wants
yield-changed-parents always on.

@@ -955,17 +959,16 @@

     def _set_specific_file_ids(self):
         """populate self.specific_file_ids if we will use it."""
- if not self.use_record_iter_changes:
- # If provided, ensure the specified files are versioned
- if self.specific_files is not None:
- # Note: This routine is being called because it raises
- # PathNotVersionedError as a side effect of finding the
IDs. We
- # later use the ids we found as input to the working
tree
- # inventory iterator, so we only consider those ids
rather than
- # examining the whole tree again.
- # XXX: Dont we have filter_unversioned to do this more
- # cheaply?
- self.specific_file_ids = tree.find_ids_across_trees(
- self.specific_files,...

review: Disapprove
Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

> **** Reading this, I'm not sure how you are getting the deltas for the
> parents if they were not selected by iter changes.
>
> Perhaps I"m missing something

(but I don't think I am - I've checked back against the previous diff
and this isn't a change from the refactoring; it was present in that
version as well.)

-Rob

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal
Download full text (6.7 KiB)

Follow-up IRC discussion re review and next steps ...

(14:46:07) igc: lifeless: so, I might be stupid but I still don't see the problem ...
(14:46:13) lifeless: ok
(14:46:24) igc: iter_changes only needs to return changed parents, not all of them
(14:46:46) lifeless: right, but unless its looking at parents it won't return them, because iter_changes has a bug
(14:46:52) igc: i.e. if a parent isn't in iter_changes, then it isn't needed in apply_delta IIUIC
(14:47:22) igc: lifeless: but it *is* looking at parent right now
(14:47:27) lifeless: how?
(14:47:33) igc: it looks at the whole tree, then post-filters
(14:47:59) lifeless: ow
(14:48:17) lifeless: I'm really torn
(14:48:17) igc: and osutils.parent_directories doesn't need to return the root
(14:48:31) igc: at the os level, '' doesn't make sense
(14:48:52) lifeless: well, other code would be cleaner if it does - as you are making sure you have [''] always
(14:49:05) igc: also, my change to when record_iter_changes was deliberate ...
(14:49:15) igc: and the comment still hold exactly as is btw
(14:49:26) igc: I benchmarked to decide when it was the right thing fwiw
(14:51:24) lifeless: why the (len(self.parents) < 2 and not self.specific_files and not self.exclude)
(14:51:27) lifeless: clause?
(14:51:46) igc: because for non-chk formats, ...
(14:52:21) igc: it's slower than the current code otherwise
(14:52:29) igc: at least on Emacs
(14:53:17) igc: I'm happy to extend the comment along those lines
(14:53:42) lifeless: So this means that the self.specific_files and self.exclude processing with record_iter_changes is _slower_ than the old full-tree code path
(14:54:02) lifeless: and that its only faster with record_iter_changes on chk repositoreis because the chk code is able to compensate
(14:54:13) lifeless: I appreciate that you've got it going faster
(14:54:17) igc: yes, according to my measurements
(14:54:47) lifeless: I'm really worried that the other changes, such as the reinstance of files_across_trees are going to make it harder for someone to come bacj and fix bug 347649 properly
(14:54:48) igc: it was consistently slower - 08. vs 04 or something like that
(14:54:48) ubottu: Launchpad bug 347649 in bzr "iter_changes missing support needed for commit" [High,Confirmed] https://launchpad.net/bugs/347649
(14:54:56) lifeless: as they will have to basically undo this patch to fix it
(14:55:28) lifeless: on the other hand I don't want to block performance improvements
(14:55:35) igc: lifeless: I did look at trying to remove the file_across_trees bit but ...
(14:55:44) igc: it broke the test suite cos ...
(14:55:45) lifeless: do you have any suggestion about how we can do both things?
(14:55:56) igc: we trap for bogus include paths
(14:56:06) igc: and iter_changes doesn't see those
(14:56:21) lifeless: file_across_trees is a problem because it upcassts the dirstate to an inventory - its spectcularly slow
(14:56:44) lifeless: emacs is what 20K files?
(14:56:59) igc: just checking, we're talking about the _set_specific_file_ids() method yes?
(14:57:06) lifeless: yes
(14:57:25) igc: emacs is 8K from memory
(14:57:31) igc: 100K revisions but ...
(14:57:38) igc: that doesn't matter myuch here
(...

Read more...

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This is effectively a resubmission of my earlier patch 6 weeks ago, with some minor clean-ups. While I agree that we can do much better than this by pushing more functionality down into iter_changes, I still believe this is a safe and valuable step forward. For example, usertest results (run tonight) on the 'selective commit' task comparing bzr.dev r4476 to r4476+patch shows an improvement from 73.4 seconds to 3.0 seconds for OOo on format 2a. In other words, this patch improves performance from "absolutely terrible on large projects" to "acceptable for most users".

FWIW, I did spent quite a bit of time over two weeks trying a different approach, namely smarter use of iter_changes - doing include filtering there - plus decorators over those results. That proved to be a dead-end because of safety concerns: race-conditions around dirstate locking mean wrong results are possible. In comparison, this patch is safe as best I can tell.

I'll leave it to others to decide whether this is enough progress on this problem for us to ship 2.0. At a minimum, I think it changes the bug priority from Critical to something less.

Revision history for this message
Robert Collins (lifeless) wrote :

Do you have cold cache (hot bzr, cold tree) performance data for it?

I ask because from my reading of the patch, it could well be worse for
ooo in that [not uncommon] scenario.

-Rob

Revision history for this message
Robert Collins (lifeless) wrote :

This can generate bad commit deltas too; its the primary thing that is making my doing the changes at the right level problematic as well.

While I don't object in principle to layering like this if it is faster, correctness in the commit code path has to be a high priority.

I have a fix-in-iter-changes working, I'm currently focused on examining all the cases that determine whether the code can product bad deltas that will be accepted silently, corrupting repositories.

review: Needs Fixing
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This has been superceded by Robert's work so I'll change the status to Rejected.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-08-18 20:05:30 +0000
3+++ NEWS 2009-08-19 02:39:06 +0000
4@@ -595,6 +595,7 @@
5 or ``bzr+ssh://`` is now much faster and involves no VFS operations.
6 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)
7
8+<<<<<<< TREE
9 * ``revision-info`` now properly aligns the revnos/revids in the output
10 and doesn't traceback when given revisions not in the current branch.
11 Performance is also significantly improved when requesting multiple revs
12@@ -603,6 +604,11 @@
13 * Tildes are no longer escaped by Transports. (Andy Kilner)
14
15
16+=======
17+* Selective commit performance on ``2a`` is now better than it is on
18+ ``1.9`` format. (Ian Clatworthy)
19+
20+>>>>>>> MERGE-SOURCE
21 Documentation
22 *************
23
24
25=== modified file 'bzrlib/commit.py'
26--- bzrlib/commit.py 2009-07-15 05:54:37 +0000
27+++ bzrlib/commit.py 2009-08-19 02:39:06 +0000
28@@ -284,12 +284,14 @@
29 # We can use record_iter_changes IFF iter_changes is compatible with
30 # the command line parameters, and the repository has fast delta
31 # generation. See bug 347649.
32- self.use_record_iter_changes = (
33- not self.specific_files and
34- not self.exclude and
35- not self.branch.repository._format.supports_tree_reference and
36- (self.branch.repository._format.fast_deltas or
37- len(self.parents) < 2))
38+ if self.branch.repository._format.supports_tree_reference:
39+ # TODO: fix this
40+ self.use_record_iter_changes = False
41+ else:
42+ self.use_record_iter_changes = \
43+ self.branch.repository._format.fast_deltas
44+ #or (len(self.parents) < 2 and not self.specific_files
45+ #and not self.exclude))
46 self.pb = bzrlib.ui.ui_factory.nested_progress_bar()
47 self.basis_revid = self.work_tree.last_revision()
48 self.basis_tree = self.work_tree.basis_tree()
49@@ -580,10 +582,10 @@
50 except Exception, e:
51 found_exception = e
52 if found_exception is not None:
53- # don't do a plan raise, because the last exception may have been
54+ # don't do a plain raise, because the last exception may have been
55 # trashed, e is our sure-to-work exception even though it loses the
56 # full traceback. XXX: RBC 20060421 perhaps we could check the
57- # exc_info and if its the same one do a plain raise otherwise
58+ # exc_info and if it's the same one do a plain raise otherwise
59 # 'raise e' as we do now.
60 raise e
61
62@@ -618,13 +620,16 @@
63 def _update_builder_with_changes(self):
64 """Update the commit builder with the data about what has changed.
65 """
66- exclude = self.exclude
67- specific_files = self.specific_files or []
68- mutter("Selecting files for commit with filter %s", specific_files)
69+ mutter("Selecting files for commit with filter %s excluding %s",
70+ self.specific_files, self.exclude)
71
72 self._check_strict()
73 if self.use_record_iter_changes:
74 iter_changes = self.work_tree.iter_changes(self.basis_tree)
75+ if self.specific_files or self.exclude:
76+ iter_changes = tree._filter_iter_changes_by_paths(iter_changes,
77+ self.specific_files, self.exclude,
78+ yield_changed_parents=True)
79 iter_changes = self._filter_iter_changes(iter_changes)
80 for file_id, path, fs_hash in self.builder.record_iter_changes(
81 self.work_tree, self.basis_revid, iter_changes):
82@@ -640,7 +645,7 @@
83
84 This method reports on the changes in iter_changes to the user, and
85 converts 'missing' entries in the iter_changes iterator to 'deleted'
86- entries. 'missing' entries have their
87+ entries.
88
89 :param iter_changes: An iter_changes to process.
90 :return: A generator of changes.
91@@ -652,7 +657,6 @@
92 if report_changes:
93 old_path = change[1][0]
94 new_path = change[1][1]
95- versioned = change[3][1]
96 kind = change[6][1]
97 versioned = change[3][1]
98 if kind is None and versioned:
99@@ -941,17 +945,16 @@
100
101 def _set_specific_file_ids(self):
102 """populate self.specific_file_ids if we will use it."""
103- if not self.use_record_iter_changes:
104- # If provided, ensure the specified files are versioned
105- if self.specific_files is not None:
106- # Note: This routine is being called because it raises
107- # PathNotVersionedError as a side effect of finding the IDs. We
108- # later use the ids we found as input to the working tree
109- # inventory iterator, so we only consider those ids rather than
110- # examining the whole tree again.
111- # XXX: Dont we have filter_unversioned to do this more
112- # cheaply?
113- self.specific_file_ids = tree.find_ids_across_trees(
114- self.specific_files, [self.basis_tree, self.work_tree])
115- else:
116- self.specific_file_ids = None
117+ # If provided, ensure the specified files are versioned
118+ if self.specific_files is not None:
119+ # Note: This routine is being called because it raises
120+ # PathNotVersionedError as a side effect of finding the IDs. We
121+ # later use the ids we found as input to the working tree
122+ # inventory iterator, so we only consider those ids rather than
123+ # examining the whole tree again.
124+ # XXX: Dont we have filter_unversioned to do this more
125+ # cheaply?
126+ self.specific_file_ids = tree.find_ids_across_trees(
127+ self.specific_files, [self.basis_tree, self.work_tree])
128+ else:
129+ self.specific_file_ids = None
130
131=== modified file 'bzrlib/repository.py'
132--- bzrlib/repository.py 2009-08-17 23:15:55 +0000
133+++ bzrlib/repository.py 2009-08-19 02:39:06 +0000
134@@ -558,7 +558,7 @@
135 :param iter_changes: An iter_changes iterator with the changes to apply
136 to basis_revision_id. The iterator must not include any items with
137 a current kind of None - missing items must be either filtered out
138- or errored-on beefore record_iter_changes sees the item.
139+ or errored-on before record_iter_changes sees the item.
140 :param _entry_factory: Private method to bind entry_factory locally for
141 performance.
142 :return: A generator of (file_id, relpath, fs_hash) tuples for use with
143
144=== modified file 'bzrlib/tests/test_tree.py'
145--- bzrlib/tests/test_tree.py 2009-03-23 14:59:43 +0000
146+++ bzrlib/tests/test_tree.py 2009-08-19 02:39:06 +0000
147@@ -23,7 +23,7 @@
148 tree as _mod_tree,
149 )
150 from bzrlib.tests import TestCaseWithTransport
151-from bzrlib.tree import InterTree
152+from bzrlib.tree import InterTree, _filter_iter_changes_by_paths
153
154
155 class TestInterTree(TestCaseWithTransport):
156@@ -417,3 +417,84 @@
157 self.assertPathToKey(([u''], u'a'), u'a')
158 self.assertPathToKey(([u'a'], u'b'), u'a/b')
159 self.assertPathToKey(([u'a', u'b'], u'c'), u'a/b/c')
160+
161+
162+class TestFilterIterChangesByPaths(TestCaseWithTransport):
163+ """Tests for tree._filter_iter_changes_by_paths()."""
164+
165+ def make_changes_iter(self):
166+ wt = self.make_branch_and_tree('.')
167+ b = wt.branch
168+ self.build_tree(['foo/', 'foo/foo1', 'bar/', 'bar/bar1', 'bar/bar2',
169+ 'baz'])
170+ wt.add(['foo', 'foo/foo1', 'bar', 'bar/bar1', 'bar/bar2', 'baz'],
171+ ['foo-id', 'foo1-id', 'bar-id', 'bar1-id', 'bar2-id', 'baz-id'])
172+ wt.commit('bar/bar1', specific_files=['bar/bar1'], rev_id='1')
173+ basis = wt.basis_tree()
174+ wt.lock_read()
175+ self.addCleanup(wt.unlock)
176+ basis.lock_read()
177+ self.addCleanup(basis.unlock)
178+ return wt.iter_changes(basis)
179+
180+ def test_no_filtering(self):
181+ original = self.make_changes_iter()
182+ original_list = list(original)
183+ filtered = _filter_iter_changes_by_paths(iter(original_list))
184+ self.assertEqual(original_list, list(filtered))
185+
186+ def test_include_files(self):
187+ original = self.make_changes_iter()
188+ filtered = _filter_iter_changes_by_paths(original,
189+ include_files=['bar'])
190+ self.assertEqual([
191+ ('bar2-id', (None, u'bar/bar2'), True, (False, True),
192+ (None, 'bar-id'), (None, u'bar2'), (None, 'file'), (None, 0)),
193+ ], list(filtered))
194+
195+ def test_include_files_yielding_changed_parents(self):
196+ original = self.make_changes_iter()
197+ filtered = _filter_iter_changes_by_paths(original,
198+ include_files=['bar'], yield_changed_parents=True)
199+ self.assertEqual([
200+ ('bar2-id', (None, u'bar/bar2'), True, (False, True),
201+ (None, 'bar-id'), (None, u'bar2'), (None, 'file'), (None, 0)),
202+ ], list(filtered))
203+
204+ def test_include_files_yielding_changed_parents2(self):
205+ original = self.make_changes_iter()
206+ filtered = _filter_iter_changes_by_paths(original,
207+ include_files=['foo'], yield_changed_parents=True)
208+ self.assertEqual([
209+ ('foo-id', (None, u'foo'), True, (False, True),
210+ (None, 'TREE_ROOT'), (None, u'foo'), (None, 'directory'), (None, 0)),
211+ ('foo1-id', (None, u'foo/foo1'), True, (False, True),
212+ (None, 'foo-id'), (None, u'foo1'), (None, 'file'), (None, 0)),
213+ ], list(filtered))
214+
215+ def test_exclude_files(self):
216+ original = self.make_changes_iter()
217+ filtered = _filter_iter_changes_by_paths(original,
218+ exclude_files=['foo'])
219+ self.assertEqual([
220+ ('baz-id', (None, u'baz'), True, (False, True),
221+ (None, 'TREE_ROOT'), (None, u'baz'), (None, 'file'), (None, 0)),
222+ ('bar2-id', (None, u'bar/bar2'), True, (False, True),
223+ (None, 'bar-id'), (None, u'bar2'), (None, 'file'), (None, 0)),
224+ ], list(filtered))
225+ pass
226+
227+ def test_excludes_override_includes(self):
228+ original = self.make_changes_iter()
229+ filtered = _filter_iter_changes_by_paths(original,
230+ include_files=['bar/bar2'], exclude_files=['bar'])
231+ self.assertEqual([], list(filtered))
232+
233+ def test_excludes_override_includes2(self):
234+ original = self.make_changes_iter()
235+ filtered = _filter_iter_changes_by_paths(original,
236+ include_files=['foo'], exclude_files=['foo/foo1'])
237+ self.assertEqual([
238+ ('foo-id', (None, u'foo'), True, (False, True),
239+ (None, 'TREE_ROOT'), (None, u'foo'), (None, 'directory'), (None, 0)),
240+ ], list(filtered))
241
242=== modified file 'bzrlib/tree.py'
243--- bzrlib/tree.py 2009-07-17 06:04:35 +0000
244+++ bzrlib/tree.py 2009-08-19 02:39:06 +0000
245@@ -35,7 +35,7 @@
246 from bzrlib import errors
247 from bzrlib.inventory import InventoryFile
248 from bzrlib.inter import InterObject
249-from bzrlib.osutils import fingerprint_file
250+from bzrlib.osutils import fingerprint_file, is_inside_any
251 import bzrlib.revision
252 from bzrlib.symbol_versioning import deprecated_function, deprecated_in
253 from bzrlib.trace import note
254@@ -1289,3 +1289,59 @@
255 other_values.append(self._lookup_by_file_id(
256 alt_extra, alt_tree, file_id))
257 yield other_path, file_id, None, other_values
258+
259+
260+def _filter_iter_changes_by_paths(changes, include_files=None,
261+ exclude_files=None, yield_changed_parents=False):
262+ """Filter the results from iter_changes using paths.
263+
264+ This decorator is useful for post-processing the output from
265+ Tree.iter_changes(). It may also be used for post-procesing the output
266+ from result-compatible methods such as Inventory.iter_changes() and
267+ PreviewTree.iter_changes(). Note that specific-path filtering should not
268+ have already been applied.
269+
270+ :param changes: the iterator of changes to filter
271+ :param include_files: paths of files and directories to include or None
272+ for no masking.
273+ :param exclude_files: paths of files and directories to exclude or None
274+ for no masking. Excludes take precedence over includes.
275+ :param yield_changed_parents: if True, include changed parent directories
276+ of included paths.
277+ """
278+ if not (include_files or exclude_files):
279+ for change in changes:
280+ yield change
281+
282+ # Find the sets of files to include, additional parents and excludes,
283+ # always including the root in the parents
284+ includes_parents = set([''])
285+ if include_files:
286+ include_set = osutils.minimum_path_selection(include_files)
287+ if yield_changed_parents:
288+ for include in include_set:
289+ for parent in osutils.parent_directories(include):
290+ if parent not in include_set:
291+ includes_parents.add(parent)
292+ else:
293+ include_set = set([''])
294+ if exclude_files:
295+ exclude_set = osutils.minimum_path_selection(exclude_files)
296+ else:
297+ exclude_set = set([])
298+
299+ for change in changes:
300+ # Decide which path to use
301+ old_path, new_path = change[1]
302+ if new_path is None:
303+ path = old_path
304+ else:
305+ path = new_path
306+
307+ # Do the filtering
308+ if exclude_set and is_inside_any(exclude_set, path):
309+ continue
310+ elif is_inside_any(include_set, path):
311+ yield change
312+ elif yield_changed_parents and path in includes_parents:
313+ yield change