Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Rejected
Rejected by: John A Meinel
Proposed branch: lp:~ian-clatworthy/bzr/faster-dirstate-saving
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 351 lines (has conflicts)
Text conflict in NEWS
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving
Reviewer Review Type Date Requested Status
Martin Pool Needs Fixing
Review via email: mp+7537@code.launchpad.net

This proposal supersedes a proposal from 2009-05-28.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

This updated merge proposal makes it configurable as to how many changes are ok before the dirstate should be saved. As Martin requested, there's the ability to turn off saving altogether. After this lands, we can use that to implement various policies (e.g. don't save on a logical read-only operation) as we see fit.

This drops 'bzr st [file]' down by 0.5+ seconds on OOo. Nice. It doesn't yet make a difference for add or commit. Those are small follow-on patches after we're happy with this one.

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.

Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)

It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.

+ :param worth_saving_limit: when the exact number of changed entries
+ is known, only bother saving the dirstate if more than this
+ count of entries have changed. -1 means never save.

I'd personally prefer "None means never save", but that's not a big deal.

I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.

I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.

+ # We only enable smart saving is it hasn't already been disabled

"is" -> "if"

+ if self._use_smart_saving is not False:
+ self._use_smart_saving = True

So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?

(If you really are just testing for True/False, it's generally a bad idea to test for True or False with "is". Use "if [not] self._use_smart_saving:" instead, which is also easier to read.)

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Review: Approve
> I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.
>
> Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file.

I'm a bit surprised that re-reading the file is significantly faster
than converting a tuple of strings back into a serialized string. If it
is significant, than I'd certainly rather a pyrex extension than reading
from disk...

>
> + :param worth_saving_limit: when the exact number of changed entries
> + is known, only bother saving the dirstate if more than this
> + count of entries have changed. -1 means never save.
>
> I'd personally prefer "None means never save", but that's not a big deal.
>
> I believe the dirstate can record changes to tree state, not just cache stat info...
> so is there a risk with this patch that e.g. "bzr mv foo bar" will go
unrecorded because it's under the _worth_saving_limit? I understand the
value in not updating the cache for only a small number of changes, but
AFAIK we always want to update the dirstate file when actual state is
changed. I think the header_too=True flag is used to make sure this
works, but that seems like a pretty indirect way of expressing the
concept that this is a non-optional update.

The *big* think to watch out for is that there is "IN_MEMORY_MODIFIED"
that happens from a stat cache update, which can be trivially ignored.
And there is "IN_MEMORY_MODIFIED" that happens from "bzr add" which
*must* be written out.

I didn't look at the code to see if Ian clearly distinguished them, if
he has, then we are fine.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoogDcACgkQJdeBCYSNAAPFWQCfeAnQLM2UPq6K+BVsleEy+szw
UIsAoKeT7L7Jybvz8vHUZI32kw9XSmNP
=SWQG
-----END PGP SIGNATURE-----

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Andrew Bennetts wrote:
> Review: Approve
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.
>
>
Andrew,

Thanks for the review!

I did carefully benchmark a few approaches here and, as discussed on
IRC, not doing anything unless we absolutely have to gives the best
result for status. For other operations (that don't leverage the ability
to track some changes yet), the best time to read is probably just
before we mutate the dirstate. FWIW, profiling showed file reading time
to be a small factor - it was rebuilding the keys in the right order
that mattered. Before mutation, we simply grab them straight out of the
dirstate. After then, we need to rebuild them from the file content.

> I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.
>
>
With this patch, there's no risk here. When follow-on patches arrive for
operations other than status, I'll need to handle those sorts of cases.

> I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.
>
>
I tried this during development. Some low level tests fail that "make
change then read disk content". Otherwise, it looked ok to me.

> + # We only enable smart saving is it hasn't already been disabled
>
> "is" -> "if"
>
> + if self._use_smart_saving is not False:
> + self._use_smart_saving = True
>
> So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?
>
>
As discussed on IRC, I'll make this "if xxx is None".

Once again, thanks for the review.

Ian C.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

I'd just like to make it clear that I'm "hesitant" on this patch. I'll admit I haven't looked at it closely, but it seems a bit along the lines of a band-aid for now sort of thing.

Especially given the complexity of keeping multiple inventories in sync...

221 + if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
222 + keys = []
223 + for directory in self._dirblocks:
224 + keys.extend([e[0] for e in directory[1]])
225 + else:

What is the use case for calling _build_line_cache when the current state is IN_MEMORY_UNMODIFIED? Certainly save() is already defined as a no-op in that case. I'd rather avoid having extra code paths that aren't really being used.
I guess maybe it is if the header changes and the blocks don't...

I'll just mention that if you look over the code, the only time _header_state = ...MODIFIED is when _dirblock_state = ... MODIFIED

The only reason it has a separate state flag, is because you can read the header without reading the rest of the dirblocks. (At least to date, certainly the knob exists for that to not be true.)
Though if you consider what is in the header:

1) num parents / parent ids (if this changes, the dirblock state certainly should)
2) ghost parents (see 1)
3) crc (Also indicative of actual content changes)
4) num_entries (if this changes, we better have dirblock changes)

review: Needs Information
Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

re: safety.

Can I propose that we have an explicit API for just updating the hash
cache (we may already). And that the current state of MODIFIED *always*
refer to having semantic changes rather than hash cache updates.

Then, the hash cache updating code should not alter the dirstate state.
Rather it should just accumulate into a 'hash cache changes counter'.
And save can check that.

This may be very similar to the code. The key thing is that I hope this
will structure things so as to avoid needing to be careful in commands.

-Rob

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Thanks to Andrew, John and Robert for their feedback on my earlier merge proposal. This updated one takes their comments into account, particularly the bit about optional vs mandatory (hash change vs semantically important) updating. It also drops the "line cache" as there are better ways (e.g. pyrex) or solving serialisation speed (which we can look into later).

As before, this improves 'bzr status' performance on OpenOffice.org by 0.5 seconds after changing a small number of files. As mentioned in NEWS, full status drops from 1.6 to 1.1 seconds while status on just a FILE drops from 1.0 to 0.5 seconds.

Revision history for this message
Martin Pool (mbp) wrote :

It looks like you're using _worth_saving_limit as both a method and a variable on similar classes, which is at least potentially confusing if not actually buggy....

Revision history for this message
Martin Pool (mbp) wrote :

> It looks like you're using _worth_saving_limit as both a method and a variable
> on similar classes, which is at least potentially confusing if not actually
> buggy....

lifeless: poolie: the last one I read looked like it could cause confusion about the internal state; jam's comments were along those lines and I agreed.
lifeless: I don't know if its been improved since then; and while its a functional improvement it is a risky area of code.

Combined with my comments I think this may need to be resubmitted.

review: Needs Fixing
Revision history for this message
Martin Pool (mbp) wrote :

I'm going to say Reject to get it out of the queue; the idea is good but this needs updates.

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

I was happy to take this out of the queue while we focussed on 2.0. Right now though, I'm pretty confused here as to why this is rejected. Can someone do a proper review please and tell me exactly what's wrong?

To my understanding, I addressed the first round of comments. If I need to rename a variable or method, that's approve-after-tweak, not reject surely.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> I was happy to take this out of the queue while we focussed on 2.0. Right now though, I'm pretty confused here as to why this is rejected. Can someone do a proper review please and tell me exactly what's wrong?
>
> To my understanding, I addressed the first round of comments. If I need to rename a variable or method, that's approve-after-tweak, not reject surely.
>

To clarify

I was cleaning up the queue a bit, where we had lots of stuff marked one
way or another in votes, but not in the actual merge status. The last
statement on this was "I'll Reject for now" from Martin, but he had not
actually set the status, so I did.

My understanding was that this was meant to be, please address some of
the issues and resubmit.

Unfortunately, there is no "Resubmit" object for a merge proposal.
Potentially we could have used "Work in Progress", which I think takes
it out of the queue.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrMmCsACgkQJdeBCYSNAANeAQCdFxROuxDkVbQ233ixrF+13CMK
mugAoLMztcg7TTwuLvEx3HS5VfgidKks
=wbea
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'NEWS'
--- NEWS 2009-08-30 23:51:10 +0000
+++ NEWS 2009-08-31 04:37:34 +0000
@@ -771,6 +771,7 @@
771Improvements771Improvements
772************772************
773773
774<<<<<<< TREE
774* ``bzr annotate`` can now be significantly faster. The time for775* ``bzr annotate`` can now be significantly faster. The time for
775 ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long776 ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long
776 histories and lots of 'duplicate insertions' will be improved more than777 histories and lots of 'duplicate insertions' will be improved more than
@@ -786,6 +787,12 @@
786* Initial commit performance in ``--2a`` repositories has been improved by787* Initial commit performance in ``--2a`` repositories has been improved by
787 making it cheaper to build the initial CHKMap. (John Arbash Meinel)788 making it cheaper to build the initial CHKMap. (John Arbash Meinel)
788789
790=======
791* ``bzr status [FILE]`` is now faster. On OpenOffice.org after changing a
792 file, the time improves from 1.6 to 1.1 seconds for full status and
793 1.0 to 0.5 seconds for status on the file. (Ian Clatworthy)
794
795>>>>>>> MERGE-SOURCE
789* Resolving a revno to a revision id on a branch accessed via ``bzr://``796* Resolving a revno to a revision id on a branch accessed via ``bzr://``
790 or ``bzr+ssh://`` is now much faster and involves no VFS operations.797 or ``bzr+ssh://`` is now much faster and involves no VFS operations.
791 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)798 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)
792799
=== modified file 'bzrlib/_dirstate_helpers_pyx.pyx'
--- bzrlib/_dirstate_helpers_pyx.pyx 2009-08-28 05:00:33 +0000
+++ bzrlib/_dirstate_helpers_pyx.pyx 2009-08-31 04:37:34 +0000
@@ -911,7 +911,7 @@
911 else:911 else:
912 entry[1][0] = ('l', '', stat_value.st_size,912 entry[1][0] = ('l', '', stat_value.st_size,
913 False, DirState.NULLSTAT)913 False, DirState.NULLSTAT)
914 self._dirblock_state = DirState.IN_MEMORY_MODIFIED914 self._mark_modified([entry])
915 return link_or_sha1915 return link_or_sha1
916916
917917
918918
=== modified file 'bzrlib/dirstate.py'
--- bzrlib/dirstate.py 2009-08-28 05:00:33 +0000
+++ bzrlib/dirstate.py 2009-08-31 04:37:34 +0000
@@ -363,11 +363,14 @@
363 HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'363 HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
364 HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'364 HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'
365365
366 def __init__(self, path, sha1_provider):366 def __init__(self, path, sha1_provider, worth_saving_limit=0):
367 """Create a DirState object.367 """Create a DirState object.
368368
369 :param path: The path at which the dirstate file on disk should live.369 :param path: The path at which the dirstate file on disk should live.
370 :param sha1_provider: an object meeting the SHA1Provider interface.370 :param sha1_provider: an object meeting the SHA1Provider interface.
371 :param worth_saving_limit: when the exact number of hash changed
372 entries is known, only bother saving the dirstate if more than
373 this count of entries have changed. -1 means never save.
371 """374 """
372 # _header_state and _dirblock_state represent the current state375 # _header_state and _dirblock_state represent the current state
373 # of the dirstate metadata and the per-row data respectiely.376 # of the dirstate metadata and the per-row data respectiely.
@@ -410,11 +413,47 @@
410 # during commit.413 # during commit.
411 self._last_block_index = None414 self._last_block_index = None
412 self._last_entry_index = None415 self._last_entry_index = None
416 # The set of known hash changes
417 self._known_hash_changes = set()
418 # How many hash changed entries can we have without saving
419 self._worth_saving_limit = worth_saving_limit
420 # If True, consider the worth saving limit when deciding whether to
421 # save the dirstate or not. If False, ignore it. If None, it can be
422 # set True but isn't True yet.
423 self._use_smart_saving = None
413424
414 def __repr__(self):425 def __repr__(self):
415 return "%s(%r)" % \426 return "%s(%r)" % \
416 (self.__class__.__name__, self._filename)427 (self.__class__.__name__, self._filename)
417428
429 def _mark_modified(self, hash_changed_entries=None, header_too=False):
430 """Mark this dirstate as modified.
431
432 :param hash_changed_entries: if non-None, mark just these entries as
433 having their hash modified.
434 :param header_too: mark the header modified as well, not just the
435 dirblocks.
436 """
437 #trace.mutter_callsite(3, "modified hash entries: %s", hash_changed_entries)
438 if hash_changed_entries:
439 self._known_hash_changes.update([e[0] for e in hash_changed_entries])
440 # We only enable smart saving is it hasn't already been disabled
441 if self._use_smart_saving is None:
442 self._use_smart_saving = True
443 else:
444 # We don't know exactly what changed so disable smart saving
445 self._use_smart_saving = False
446 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
447 if header_too:
448 self._header_state = DirState.IN_MEMORY_MODIFIED
449
450 def _mark_unmodified(self):
451 """Mark this dirstate as unmodified."""
452 self._header_state = DirState.IN_MEMORY_UNMODIFIED
453 self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
454 self._use_smart_saving = None
455 self._known_hash_changes = set()
456
418 def add(self, path, file_id, kind, stat, fingerprint):457 def add(self, path, file_id, kind, stat, fingerprint):
419 """Add a path to be tracked.458 """Add a path to be tracked.
420459
@@ -546,7 +585,7 @@
546 if kind == 'directory':585 if kind == 'directory':
547 # insert a new dirblock586 # insert a new dirblock
548 self._ensure_block(block_index, entry_index, utf8path)587 self._ensure_block(block_index, entry_index, utf8path)
549 self._dirblock_state = DirState.IN_MEMORY_MODIFIED588 self._mark_modified()
550 if self._id_index:589 if self._id_index:
551 self._id_index.setdefault(entry_key[2], set()).add(entry_key)590 self._id_index.setdefault(entry_key[2], set()).add(entry_key)
552591
@@ -1018,8 +1057,7 @@
10181057
1019 self._ghosts = []1058 self._ghosts = []
1020 self._parents = [parents[0]]1059 self._parents = [parents[0]]
1021 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1060 self._mark_modified(header_too=True)
1022 self._header_state = DirState.IN_MEMORY_MODIFIED
10231061
1024 def _empty_parent_info(self):1062 def _empty_parent_info(self):
1025 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -1063 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
@@ -1555,8 +1593,7 @@
1555 # the active tree.1593 # the active tree.
1556 raise errors.InconsistentDeltaDelta(delta, "error from _get_entry.")1594 raise errors.InconsistentDeltaDelta(delta, "error from _get_entry.")
15571595
1558 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1596 self._mark_modified(header_too=True)
1559 self._header_state = DirState.IN_MEMORY_MODIFIED
1560 self._id_index = None1597 self._id_index = None
1561 return1598 return
15621599
@@ -1735,7 +1772,7 @@
1735 and stat_value.st_ctime < self._cutoff_time):1772 and stat_value.st_ctime < self._cutoff_time):
1736 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],1773 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
1737 packed_stat)1774 packed_stat)
1738 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1775 self._mark_modified()
17391776
1740 def _sha_cutoff_time(self):1777 def _sha_cutoff_time(self):
1741 """Return cutoff time.1778 """Return cutoff time.
@@ -1799,14 +1836,13 @@
1799 """Serialise the entire dirstate to a sequence of lines."""1836 """Serialise the entire dirstate to a sequence of lines."""
1800 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and1837 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
1801 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):1838 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
1802 # read whats on disk.1839 # read what's on disk.
1803 self._state_file.seek(0)1840 self._state_file.seek(0)
1804 return self._state_file.readlines()1841 return self._state_file.readlines()
1805 lines = []1842 lines = []
1806 lines.append(self._get_parents_line(self.get_parent_ids()))1843 lines.append(self._get_parents_line(self.get_parent_ids()))
1807 lines.append(self._get_ghosts_line(self._ghosts))1844 lines.append(self._get_ghosts_line(self._ghosts))
1808 # append the root line which is special cased1845 lines.extend(self._get_entry_lines())
1809 lines.extend(map(self._entry_to_line, self._iter_entries()))
1810 return self._get_output_lines(lines)1846 return self._get_output_lines(lines)
18111847
1812 def _get_ghosts_line(self, ghost_ids):1848 def _get_ghosts_line(self, ghost_ids):
@@ -1817,6 +1853,10 @@
1817 """Create a line for the state file for parents information."""1853 """Create a line for the state file for parents information."""
1818 return '\0'.join([str(len(parent_ids))] + parent_ids)1854 return '\0'.join([str(len(parent_ids))] + parent_ids)
18191855
1856 def _get_entry_lines(self):
1857 """Create lines for entries."""
1858 return map(self._entry_to_line, self._iter_entries())
1859
1820 def _get_fields_to_entry(self):1860 def _get_fields_to_entry(self):
1821 """Get a function which converts entry fields into a entry record.1861 """Get a function which converts entry fields into a entry record.
18221862
@@ -2175,17 +2215,21 @@
2175 return len(self._parents) - len(self._ghosts)2215 return len(self._parents) - len(self._ghosts)
21762216
2177 @staticmethod2217 @staticmethod
2178 def on_file(path, sha1_provider=None):2218 def on_file(path, sha1_provider=None, worth_saving_limit=0):
2179 """Construct a DirState on the file at path "path".2219 """Construct a DirState on the file at path "path".
21802220
2181 :param path: The path at which the dirstate file on disk should live.2221 :param path: The path at which the dirstate file on disk should live.
2182 :param sha1_provider: an object meeting the SHA1Provider interface.2222 :param sha1_provider: an object meeting the SHA1Provider interface.
2183 If None, a DefaultSHA1Provider is used.2223 If None, a DefaultSHA1Provider is used.
2224 :param worth_saving_limit: when the exact number of hash changed
2225 entries is known, only bother saving the dirstate if more than
2226 this count of entries have changed. -1 means never save.
2184 :return: An unlocked DirState object, associated with the given path.2227 :return: An unlocked DirState object, associated with the given path.
2185 """2228 """
2186 if sha1_provider is None:2229 if sha1_provider is None:
2187 sha1_provider = DefaultSHA1Provider()2230 sha1_provider = DefaultSHA1Provider()
2188 result = DirState(path, sha1_provider)2231 result = DirState(path, sha1_provider,
2232 worth_saving_limit=worth_saving_limit)
2189 return result2233 return result
21902234
2191 def _read_dirblocks_if_needed(self):2235 def _read_dirblocks_if_needed(self):
@@ -2283,9 +2327,7 @@
2283 trace.mutter('Not saving DirState because '2327 trace.mutter('Not saving DirState because '
2284 '_changes_aborted is set.')2328 '_changes_aborted is set.')
2285 return2329 return
2286 if (self._header_state == DirState.IN_MEMORY_MODIFIED or2330 if self._worth_saving():
2287 self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
2288
2289 grabbed_write_lock = False2331 grabbed_write_lock = False
2290 if self._lock_state != 'w':2332 if self._lock_state != 'w':
2291 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()2333 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()
@@ -2299,12 +2341,12 @@
2299 # We couldn't grab a write lock, so we switch back to a read one2341 # We couldn't grab a write lock, so we switch back to a read one
2300 return2342 return
2301 try:2343 try:
2344 lines = self.get_lines()
2302 self._state_file.seek(0)2345 self._state_file.seek(0)
2303 self._state_file.writelines(self.get_lines())2346 self._state_file.writelines(lines)
2304 self._state_file.truncate()2347 self._state_file.truncate()
2305 self._state_file.flush()2348 self._state_file.flush()
2306 self._header_state = DirState.IN_MEMORY_UNMODIFIED2349 self._mark_unmodified()
2307 self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
2308 finally:2350 finally:
2309 if grabbed_write_lock:2351 if grabbed_write_lock:
2310 self._lock_token = self._lock_token.restore_read_lock()2352 self._lock_token = self._lock_token.restore_read_lock()
@@ -2313,6 +2355,25 @@
2313 # not changed contents. Since restore_read_lock may2355 # not changed contents. Since restore_read_lock may
2314 # not be an atomic operation.2356 # not be an atomic operation.
23152357
2358 def _worth_saving(self):
2359 """Is it worth saving the dirstate or not?"""
2360 if self._header_state == DirState.IN_MEMORY_MODIFIED:
2361 return True
2362 if (self._dirblock_state == DirState.IN_MEMORY_MODIFIED and
2363 self._worth_saving_limit != -1):
2364 # If we're using smart saving and only a small number of
2365 # entries have changed their hash, don't bother saving. John has
2366 # suggested using a heuristic here based on the size of the
2367 # changed files and/or tree. For now, we go with a configurable
2368 # number of changes, keeping the calculation time
2369 # as low overhead as possible. (This also keeps all existing
2370 # tests passing as the default is 0, i.e. always save.)
2371 if self._use_smart_saving:
2372 return len(self._known_hash_changes) > self._worth_saving_limit
2373 else:
2374 return True
2375 return False
2376
2316 def _set_data(self, parent_ids, dirblocks):2377 def _set_data(self, parent_ids, dirblocks):
2317 """Set the full dirstate data in memory.2378 """Set the full dirstate data in memory.
23182379
@@ -2326,8 +2387,7 @@
2326 """2387 """
2327 # our memory copy is now authoritative.2388 # our memory copy is now authoritative.
2328 self._dirblocks = dirblocks2389 self._dirblocks = dirblocks
2329 self._header_state = DirState.IN_MEMORY_MODIFIED2390 self._mark_modified(header_too=True)
2330 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
2331 self._parents = list(parent_ids)2391 self._parents = list(parent_ids)
2332 self._id_index = None2392 self._id_index = None
2333 self._packed_stat_index = None2393 self._packed_stat_index = None
@@ -2353,7 +2413,7 @@
2353 self._make_absent(entry)2413 self._make_absent(entry)
2354 self.update_minimal(('', '', new_id), 'd',2414 self.update_minimal(('', '', new_id), 'd',
2355 path_utf8='', packed_stat=entry[1][0][4])2415 path_utf8='', packed_stat=entry[1][0][4])
2356 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2416 self._mark_modified()
2357 if self._id_index is not None:2417 if self._id_index is not None:
2358 self._id_index.setdefault(new_id, set()).add(entry[0])2418 self._id_index.setdefault(new_id, set()).add(entry[0])
23592419
@@ -2493,8 +2553,7 @@
2493 self._entries_to_current_state(new_entries)2553 self._entries_to_current_state(new_entries)
2494 self._parents = [rev_id for rev_id, tree in trees]2554 self._parents = [rev_id for rev_id, tree in trees]
2495 self._ghosts = list(ghosts)2555 self._ghosts = list(ghosts)
2496 self._header_state = DirState.IN_MEMORY_MODIFIED2556 self._mark_modified(header_too=True)
2497 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
2498 self._id_index = id_index2557 self._id_index = id_index
24992558
2500 def _sort_entries(self, entry_list):2559 def _sort_entries(self, entry_list):
@@ -2637,7 +2696,7 @@
2637 current_old[0][1].decode('utf8'))2696 current_old[0][1].decode('utf8'))
2638 self._make_absent(current_old)2697 self._make_absent(current_old)
2639 current_old = advance(old_iterator)2698 current_old = advance(old_iterator)
2640 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2699 self._mark_modified()
2641 self._id_index = None2700 self._id_index = None
2642 self._packed_stat_index = None2701 self._packed_stat_index = None
2643 if tracing:2702 if tracing:
@@ -2692,7 +2751,7 @@
2692 if update_tree_details[0][0] == 'a': # absent2751 if update_tree_details[0][0] == 'a': # absent
2693 raise AssertionError('bad row %r' % (update_tree_details,))2752 raise AssertionError('bad row %r' % (update_tree_details,))
2694 update_tree_details[0] = DirState.NULL_PARENT_DETAILS2753 update_tree_details[0] = DirState.NULL_PARENT_DETAILS
2695 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2754 self._mark_modified()
2696 return last_reference2755 return last_reference
26972756
2698 def update_minimal(self, key, minikind, executable=False, fingerprint='',2757 def update_minimal(self, key, minikind, executable=False, fingerprint='',
@@ -2857,7 +2916,7 @@
2857 if not present:2916 if not present:
2858 self._dirblocks.insert(block_index, (subdir_key[0], []))2917 self._dirblocks.insert(block_index, (subdir_key[0], []))
28592918
2860 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2919 self._mark_modified()
28612920
2862 def _maybe_remove_row(self, block, index, id_index):2921 def _maybe_remove_row(self, block, index, id_index):
2863 """Remove index if it is absent or relocated across the row.2922 """Remove index if it is absent or relocated across the row.
@@ -3158,7 +3217,7 @@
3158 else:3217 else:
3159 entry[1][0] = ('l', '', stat_value.st_size,3218 entry[1][0] = ('l', '', stat_value.st_size,
3160 False, DirState.NULLSTAT)3219 False, DirState.NULLSTAT)
3161 state._dirblock_state = DirState.IN_MEMORY_MODIFIED3220 state._mark_modified([entry])
3162 return link_or_sha13221 return link_or_sha1
31633222
31643223
31653224
=== modified file 'bzrlib/workingtree_4.py'
--- bzrlib/workingtree_4.py 2009-08-25 04:43:21 +0000
+++ bzrlib/workingtree_4.py 2009-08-31 04:37:35 +0000
@@ -219,7 +219,7 @@
219 local_path = self.bzrdir.get_workingtree_transport(None219 local_path = self.bzrdir.get_workingtree_transport(None
220 ).local_abspath('dirstate')220 ).local_abspath('dirstate')
221 self._dirstate = dirstate.DirState.on_file(local_path,221 self._dirstate = dirstate.DirState.on_file(local_path,
222 self._sha1_provider())222 self._sha1_provider(), self._worth_saving_limit())
223 return self._dirstate223 return self._dirstate
224224
225 def _sha1_provider(self):225 def _sha1_provider(self):
@@ -234,6 +234,15 @@
234 else:234 else:
235 return None235 return None
236236
237 def _worth_saving_limit(self):
238 """How many hash changes are ok before we must save the dirstate.
239
240 :return: an integer. -1 means never save.
241 """
242 # XXX: In the future, we could return -1 for logical read-only
243 # operations like status. For now, just use a small number.
244 return 10
245
237 def filter_unversioned_files(self, paths):246 def filter_unversioned_files(self, paths):
238 """Filter out paths that are versioned.247 """Filter out paths that are versioned.
239248
@@ -839,7 +848,7 @@
839 rollback_rename()848 rollback_rename()
840 raise849 raise
841 result.append((from_rel, to_rel))850 result.append((from_rel, to_rel))
842 state._dirblock_state = dirstate.DirState.IN_MEMORY_MODIFIED851 state._mark_modified()
843 self._make_dirty(reset_inventory=False)852 self._make_dirty(reset_inventory=False)
844853
845 return result854 return result