Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Superseded
Proposed branch: lp:~ian-clatworthy/bzr/faster-dirstate-saving
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 392 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving
Reviewer Review Type Date Requested Status
John A Meinel Needs Information
Andrew Bennetts Approve
Review via email: mp+6861@code.launchpad.net

This proposal supersedes a proposal from 2009-05-28.

This proposal has been superseded by a proposal from 2009-06-17.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This updated merge proposal makes it configurable as to how many changes are ok before the dirstate should be saved. As Martin requested, there's the ability to turn off saving altogether. After this lands, we can use that to implement various policies (e.g. don't save on a logical read-only operation) as we see fit.

This drops 'bzr st [file]' down by 0.5+ seconds on OOo. Nice. It doesn't yet make a difference for add or commit. Those are small follow-on patches after we're happy with this one.

Revision history for this message
Andrew Bennetts (spiv) wrote :

I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.

Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)

It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.

+ :param worth_saving_limit: when the exact number of changed entries
+ is known, only bother saving the dirstate if more than this
+ count of entries have changed. -1 means never save.

I'd personally prefer "None means never save", but that's not a big deal.

I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.

I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.

+ # We only enable smart saving is it hasn't already been disabled

"is" -> "if"

+ if self._use_smart_saving is not False:
+ self._use_smart_saving = True

So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?

(If you really are just testing for True/False, it's generally a bad idea to test for True or False with "is". Use "if [not] self._use_smart_saving:" instead, which is also easier to read.)

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Review: Approve
> I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.
>
> Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file.

I'm a bit surprised that re-reading the file is significantly faster
than converting a tuple of strings back into a serialized string. If it
is significant, than I'd certainly rather a pyrex extension than reading
from disk...

>
> + :param worth_saving_limit: when the exact number of changed entries
> + is known, only bother saving the dirstate if more than this
> + count of entries have changed. -1 means never save.
>
> I'd personally prefer "None means never save", but that's not a big deal.
>
> I believe the dirstate can record changes to tree state, not just cache stat info...
> so is there a risk with this patch that e.g. "bzr mv foo bar" will go
unrecorded because it's under the _worth_saving_limit? I understand the
value in not updating the cache for only a small number of changes, but
AFAIK we always want to update the dirstate file when actual state is
changed. I think the header_too=True flag is used to make sure this
works, but that seems like a pretty indirect way of expressing the
concept that this is a non-optional update.

The *big* think to watch out for is that there is "IN_MEMORY_MODIFIED"
that happens from a stat cache update, which can be trivially ignored.
And there is "IN_MEMORY_MODIFIED" that happens from "bzr add" which
*must* be written out.

I didn't look at the code to see if Ian clearly distinguished them, if
he has, then we are fine.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoogDcACgkQJdeBCYSNAAPFWQCfeAnQLM2UPq6K+BVsleEy+szw
UIsAoKeT7L7Jybvz8vHUZI32kw9XSmNP
=SWQG
-----END PGP SIGNATURE-----

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Andrew Bennetts wrote:
> Review: Approve
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.
>
>
Andrew,

Thanks for the review!

I did carefully benchmark a few approaches here and, as discussed on
IRC, not doing anything unless we absolutely have to gives the best
result for status. For other operations (that don't leverage the ability
to track some changes yet), the best time to read is probably just
before we mutate the dirstate. FWIW, profiling showed file reading time
to be a small factor - it was rebuilding the keys in the right order
that mattered. Before mutation, we simply grab them straight out of the
dirstate. After then, we need to rebuild them from the file content.

> I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.
>
>
With this patch, there's no risk here. When follow-on patches arrive for
operations other than status, I'll need to handle those sorts of cases.

> I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.
>
>
I tried this during development. Some low level tests fail that "make
change then read disk content". Otherwise, it looked ok to me.

> + # We only enable smart saving is it hasn't already been disabled
>
> "is" -> "if"
>
> + if self._use_smart_saving is not False:
> + self._use_smart_saving = True
>
> So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?
>
>
As discussed on IRC, I'll make this "if xxx is None".

Once again, thanks for the review.

Ian C.

Revision history for this message
John A Meinel (jameinel) wrote :

I'd just like to make it clear that I'm "hesitant" on this patch. I'll admit I haven't looked at it closely, but it seems a bit along the lines of a band-aid for now sort of thing.

Especially given the complexity of keeping multiple inventories in sync...

221 + if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
222 + keys = []
223 + for directory in self._dirblocks:
224 + keys.extend([e[0] for e in directory[1]])
225 + else:

What is the use case for calling _build_line_cache when the current state is IN_MEMORY_UNMODIFIED? Certainly save() is already defined as a no-op in that case. I'd rather avoid having extra code paths that aren't really being used.
I guess maybe it is if the header changes and the blocks don't...

I'll just mention that if you look over the code, the only time _header_state = ...MODIFIED is when _dirblock_state = ... MODIFIED

The only reason it has a separate state flag, is because you can read the header without reading the rest of the dirblocks. (At least to date, certainly the knob exists for that to not be true.)
Though if you consider what is in the header:

1) num parents / parent ids (if this changes, the dirblock state certainly should)
2) ghost parents (see 1)
3) crc (Also indicative of actual content changes)
4) num_entries (if this changes, we better have dirblock changes)

review: Needs Information
Revision history for this message
Robert Collins (lifeless) wrote :

re: safety.

Can I propose that we have an explicit API for just updating the hash
cache (we may already). And that the current state of MODIFIED *always*
refer to having semantic changes rather than hash cache updates.

Then, the hash cache updating code should not alter the dirstate state.
Rather it should just accumulate into a 'hash cache changes counter'.
And save can check that.

This may be very similar to the code. The key thing is that I hope this
will structure things so as to avoid needing to be careful in commands.

-Rob

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/_dirstate_helpers_c.pyx'
2--- bzrlib/_dirstate_helpers_c.pyx 2009-06-09 01:52:10 +0000
3+++ bzrlib/_dirstate_helpers_c.pyx 2009-06-16 02:37:03 +0000
4@@ -909,7 +909,7 @@
5 else:
6 entry[1][0] = ('l', '', stat_value.st_size,
7 False, DirState.NULLSTAT)
8- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
9+ self._mark_modified([entry])
10 return link_or_sha1
11
12
13
14=== modified file 'bzrlib/dirstate.py'
15--- bzrlib/dirstate.py 2009-06-01 08:55:17 +0000
16+++ bzrlib/dirstate.py 2009-06-16 02:37:03 +0000
17@@ -362,11 +362,14 @@
18 HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
19 HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'
20
21- def __init__(self, path, sha1_provider):
22+ def __init__(self, path, sha1_provider, worth_saving_limit=0):
23 """Create a DirState object.
24
25 :param path: The path at which the dirstate file on disk should live.
26 :param sha1_provider: an object meeting the SHA1Provider interface.
27+ :param worth_saving_limit: when the exact number of changed entries
28+ is known, only bother saving the dirstate if more than this
29+ count of entries have changed. -1 means never save.
30 """
31 # _header_state and _dirblock_state represent the current state
32 # of the dirstate metadata and the per-row data respectiely.
33@@ -409,11 +412,48 @@
34 # during commit.
35 self._last_block_index = None
36 self._last_entry_index = None
37+ # If True, use the per-entry field cache for faster serialisation.
38+ # If False, disable it. If None, it is not used but may be enabled.
39+ self._use_smart_saving = None
40+ # The set of known changes
41+ self._known_changes = set()
42+ # The cache of serialised lines. When built, this is a tuple of
43+ # 2 sorted lists that we "walk" while serialising.
44+ self._line_cache = None
45+ # How many changed entries can we have without saving
46+ self._worth_saving_limit = worth_saving_limit
47
48 def __repr__(self):
49 return "%s(%r)" % \
50 (self.__class__.__name__, self._filename)
51
52+ def _mark_modified(self, entries=None, header_too=False):
53+ """Mark this dirstate as modified.
54+
55+ :param entries: if non-None, mark just these entries as modified.
56+ :param header_too: mark the header modified as well, not just the
57+ dirblocks.
58+ """
59+ #trace.mutter_callsite(3, "modified entries: %s", entries)
60+ if entries:
61+ self._known_changes.update([e[0] for e in entries])
62+ # We only enable smart saving is it hasn't already been disabled
63+ if self._use_smart_saving is not False:
64+ self._use_smart_saving = True
65+ else:
66+ # We don't know exactly what changed so disable smart saving
67+ self._use_smart_saving = False
68+ self._dirblock_state = DirState.IN_MEMORY_MODIFIED
69+ if header_too:
70+ self._header_state = DirState.IN_MEMORY_MODIFIED
71+
72+ def _mark_unmodified(self):
73+ """Mark this dirstate as unmodified."""
74+ self._header_state = DirState.IN_MEMORY_UNMODIFIED
75+ self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
76+ self._use_smart_saving = None
77+ self._known_changes = set()
78+
79 def add(self, path, file_id, kind, stat, fingerprint):
80 """Add a path to be tracked.
81
82@@ -545,7 +585,7 @@
83 if kind == 'directory':
84 # insert a new dirblock
85 self._ensure_block(block_index, entry_index, utf8path)
86- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
87+ self._mark_modified()
88 if self._id_index:
89 self._id_index.setdefault(entry_key[2], set()).add(entry_key)
90
91@@ -1017,8 +1057,7 @@
92
93 self._ghosts = []
94 self._parents = [parents[0]]
95- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
96- self._header_state = DirState.IN_MEMORY_MODIFIED
97+ self._mark_modified(header_too=True)
98
99 def _empty_parent_info(self):
100 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
101@@ -1460,8 +1499,7 @@
102 # Apply in-situ changes.
103 self._update_basis_apply_changes(changes)
104
105- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
106- self._header_state = DirState.IN_MEMORY_MODIFIED
107+ self._mark_modified(header_too=True)
108 self._id_index = None
109 return
110
111@@ -1594,7 +1632,7 @@
112 and stat_value.st_ctime < self._cutoff_time):
113 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
114 packed_stat)
115- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
116+ self._mark_modified()
117
118 def _sha_cutoff_time(self):
119 """Return cutoff time.
120@@ -1658,14 +1696,13 @@
121 """Serialise the entire dirstate to a sequence of lines."""
122 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
123 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
124- # read whats on disk.
125+ # read what's on disk.
126 self._state_file.seek(0)
127 return self._state_file.readlines()
128 lines = []
129 lines.append(self._get_parents_line(self.get_parent_ids()))
130 lines.append(self._get_ghosts_line(self._ghosts))
131- # append the root line which is special cased
132- lines.extend(map(self._entry_to_line, self._iter_entries()))
133+ lines.extend(self._get_entry_lines())
134 return self._get_output_lines(lines)
135
136 def _get_ghosts_line(self, ghost_ids):
137@@ -1676,6 +1713,37 @@
138 """Create a line for the state file for parents information."""
139 return '\0'.join([str(len(parent_ids))] + parent_ids)
140
141+ def _get_entry_lines(self):
142+ """Create lines for entries."""
143+ if self._use_smart_saving:
144+ # Build the line cache if it hasn't already been built
145+ self._build_line_cache()
146+ # We unroll this case for better performance ...
147+ # The line cache is a tuple of 2 ordered lists: keys and lines.
148+ # We keep track of successful matches and only search from there
149+ # on next time.
150+ entry_to_line = self._entry_to_line
151+ known_changes = self._known_changes
152+ index = 0
153+ keys, serialised = self._line_cache
154+ result = []
155+ for entry in self._iter_entries():
156+ key = entry[0]
157+ if key in known_changes:
158+ result.append(entry_to_line(entry))
159+ else:
160+ if keys[index] != key:
161+ try:
162+ index = keys.index(key, index + 1)
163+ except ValueError:
164+ result.append(entry_to_line(entry))
165+ continue
166+ result.append(serialised[index])
167+ index += 1
168+ return result
169+ else:
170+ return map(self._entry_to_line, self._iter_entries())
171+
172 def _get_fields_to_entry(self):
173 """Get a function which converts entry fields into a entry record.
174
175@@ -2034,17 +2102,21 @@
176 return len(self._parents) - len(self._ghosts)
177
178 @staticmethod
179- def on_file(path, sha1_provider=None):
180+ def on_file(path, sha1_provider=None, worth_saving_limit=0):
181 """Construct a DirState on the file at path "path".
182
183 :param path: The path at which the dirstate file on disk should live.
184 :param sha1_provider: an object meeting the SHA1Provider interface.
185 If None, a DefaultSHA1Provider is used.
186+ :param worth_saving_limit: when the exact number of changed entries
187+ is known, only bother saving the dirstate if more than this
188+ count of entries have changed. -1 means never save.
189 :return: An unlocked DirState object, associated with the given path.
190 """
191 if sha1_provider is None:
192 sha1_provider = DefaultSHA1Provider()
193- result = DirState(path, sha1_provider)
194+ result = DirState(path, sha1_provider,
195+ worth_saving_limit=worth_saving_limit)
196 return result
197
198 def _read_dirblocks_if_needed(self):
199@@ -2058,6 +2130,36 @@
200 if self._dirblock_state == DirState.NOT_IN_MEMORY:
201 _read_dirblocks(self)
202
203+ def _build_line_cache(self):
204+ """Build the line cache.
205+
206+ The line cache maps entry keys to serialised lines via
207+ a tuple of 2 sorted lists.
208+ """
209+ if self._line_cache is not None:
210+ # already built
211+ return
212+ self._state_file.seek(0)
213+ lines = self._state_file.readlines()
214+ # There are 5 header lines: 3 in the prelude, a line for
215+ # parents and a line for ghosts. There is also a trailing
216+ # empty line. We skip over those.
217+ # Each line starts with a null and ends with a null and
218+ # newline. We don't keep those because the serialisation
219+ # process adds them.
220+ values = [l[1:-2] for l in lines[5:-1]]
221+ if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
222+ keys = []
223+ for directory in self._dirblocks:
224+ keys.extend([e[0] for e in directory[1]])
225+ else:
226+ # Be safe and calculate the keys from the lines
227+ keys = []
228+ for v in values:
229+ fields = v.split('\0', 3)
230+ keys.append((fields[0], fields[1], fields[2]))
231+ self._line_cache = (keys, values)
232+
233 def _read_header(self):
234 """This reads in the metadata header, and the parent ids.
235
236@@ -2142,9 +2244,7 @@
237 trace.mutter('Not saving DirState because '
238 '_changes_aborted is set.')
239 return
240- if (self._header_state == DirState.IN_MEMORY_MODIFIED or
241- self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
242-
243+ if self._worth_saving():
244 grabbed_write_lock = False
245 if self._lock_state != 'w':
246 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()
247@@ -2158,12 +2258,12 @@
248 # We couldn't grab a write lock, so we switch back to a read one
249 return
250 try:
251+ lines = self.get_lines()
252 self._state_file.seek(0)
253- self._state_file.writelines(self.get_lines())
254+ self._state_file.writelines(lines)
255 self._state_file.truncate()
256 self._state_file.flush()
257- self._header_state = DirState.IN_MEMORY_UNMODIFIED
258- self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
259+ self._mark_unmodified()
260 finally:
261 if grabbed_write_lock:
262 self._lock_token = self._lock_token.restore_read_lock()
263@@ -2172,6 +2272,26 @@
264 # not changed contents. Since restore_read_lock may
265 # not be an atomic operation.
266
267+ def _worth_saving(self):
268+ """Is it worth saving the dirstate or not?"""
269+ # Header changes are probably important enough to always save
270+ if self._header_state == DirState.IN_MEMORY_MODIFIED:
271+ return True
272+ if (self._dirblock_state == DirState.IN_MEMORY_MODIFIED and
273+ self._worth_saving_limit != -1):
274+ # If we're using smart saving and only a small number of
275+ # entries have changed, don't bother saving. John has
276+ # suggested using a heuristic here based on the size of the
277+ # changed files and/or tree. For now, we go with a configurable
278+ # number of changes, keeping the calculation time
279+ # as low overhead as possible. (This also keeps all existing
280+ # tests passing as the default is 0, i.e. always save.)
281+ if self._use_smart_saving:
282+ return len(self._known_changes) > self._worth_saving_limit
283+ else:
284+ return True
285+ return False
286+
287 def _set_data(self, parent_ids, dirblocks):
288 """Set the full dirstate data in memory.
289
290@@ -2185,8 +2305,7 @@
291 """
292 # our memory copy is now authoritative.
293 self._dirblocks = dirblocks
294- self._header_state = DirState.IN_MEMORY_MODIFIED
295- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
296+ self._mark_modified(header_too=True)
297 self._parents = list(parent_ids)
298 self._id_index = None
299 self._packed_stat_index = None
300@@ -2212,7 +2331,7 @@
301 self._make_absent(entry)
302 self.update_minimal(('', '', new_id), 'd',
303 path_utf8='', packed_stat=entry[1][0][4])
304- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
305+ self._mark_modified()
306 if self._id_index is not None:
307 self._id_index.setdefault(new_id, set()).add(entry[0])
308
309@@ -2352,8 +2471,7 @@
310 self._entries_to_current_state(new_entries)
311 self._parents = [rev_id for rev_id, tree in trees]
312 self._ghosts = list(ghosts)
313- self._header_state = DirState.IN_MEMORY_MODIFIED
314- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
315+ self._mark_modified(header_too=True)
316 self._id_index = id_index
317
318 def _sort_entries(self, entry_list):
319@@ -2471,7 +2589,7 @@
320 # without seeing it in the new list. so it must be gone.
321 self._make_absent(current_old)
322 current_old = advance(old_iterator)
323- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
324+ self._mark_modified()
325 self._id_index = None
326 self._packed_stat_index = None
327
328@@ -2524,7 +2642,7 @@
329 if update_tree_details[0][0] == 'a': # absent
330 raise AssertionError('bad row %r' % (update_tree_details,))
331 update_tree_details[0] = DirState.NULL_PARENT_DETAILS
332- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
333+ self._mark_modified()
334 return last_reference
335
336 def update_minimal(self, key, minikind, executable=False, fingerprint='',
337@@ -2650,7 +2768,7 @@
338 if not present:
339 self._dirblocks.insert(block_index, (subdir_key[0], []))
340
341- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
342+ self._mark_modified()
343
344 def _validate(self):
345 """Check that invariants on the dirblock are correct.
346@@ -2936,7 +3054,7 @@
347 else:
348 entry[1][0] = ('l', '', stat_value.st_size,
349 False, DirState.NULLSTAT)
350- state._dirblock_state = DirState.IN_MEMORY_MODIFIED
351+ state._mark_modified([entry])
352 return link_or_sha1
353 update_entry = py_update_entry
354
355
356=== modified file 'bzrlib/workingtree_4.py'
357--- bzrlib/workingtree_4.py 2009-06-11 04:23:53 +0000
358+++ bzrlib/workingtree_4.py 2009-06-16 02:37:03 +0000
359@@ -248,7 +248,7 @@
360 local_path = self.bzrdir.get_workingtree_transport(None
361 ).local_abspath('dirstate')
362 self._dirstate = dirstate.DirState.on_file(local_path,
363- self._sha1_provider())
364+ self._sha1_provider(), self._worth_saving_limit())
365 return self._dirstate
366
367 def _sha1_provider(self):
368@@ -263,6 +263,15 @@
369 else:
370 return None
371
372+ def _worth_saving_limit(self):
373+ """How many changes are ok before we must save the dirstate.
374+
375+ :return: an integer. -1 means never save.
376+ """
377+ # XXX: In the future, we could return -1 for logical read-only
378+ # operations like status. For now, just use a small number.
379+ return 10
380+
381 def filter_unversioned_files(self, paths):
382 """Filter out paths that are versioned.
383
384@@ -863,7 +872,7 @@
385 rollback_rename()
386 raise
387 result.append((from_rel, to_rel))
388- state._dirblock_state = dirstate.DirState.IN_MEMORY_MODIFIED
389+ state._mark_modified()
390 self._make_dirty(reset_inventory=False)
391
392 return result