Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy on 2009-05-28
Status: Superseded
Proposed branch: lp:~ian-clatworthy/bzr/faster-dirstate-saving
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 392 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving
Reviewer Review Type Date Requested Status
John A Meinel Needs Information on 2009-06-05
Andrew Bennetts 2009-05-28 Approve on 2009-06-05
Review via email: mp+6861@code.launchpad.net

This proposal supersedes a proposal from 2009-05-28.

This proposal has been superseded by a proposal from 2009-06-17.

To post a comment you must log in.
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Martin Pool (mbp) wrote : Posted in a previous version of this proposal

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

Ian Clatworthy (ian-clatworthy) wrote :

This updated merge proposal makes it configurable as to how many changes are ok before the dirstate should be saved. As Martin requested, there's the ability to turn off saving altogether. After this lands, we can use that to implement various policies (e.g. don't save on a logical read-only operation) as we see fit.

This drops 'bzr st [file]' down by 0.5+ seconds on OOo. Nice. It doesn't yet make a difference for add or commit. Those are small follow-on patches after we're happy with this one.

Andrew Bennetts (spiv) wrote :

I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.

Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)

It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.

+ :param worth_saving_limit: when the exact number of changed entries
+ is known, only bother saving the dirstate if more than this
+ count of entries have changed. -1 means never save.

I'd personally prefer "None means never save", but that's not a big deal.

I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.

I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.

+ # We only enable smart saving is it hasn't already been disabled

"is" -> "if"

+ if self._use_smart_saving is not False:
+ self._use_smart_saving = True

So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?

(If you really are just testing for True/False, it's generally a bad idea to test for True or False with "is". Use "if [not] self._use_smart_saving:" instead, which is also easier to read.)

review: Approve
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Review: Approve
> I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.
>
> Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file.

I'm a bit surprised that re-reading the file is significantly faster
than converting a tuple of strings back into a serialized string. If it
is significant, than I'd certainly rather a pyrex extension than reading
from disk...

>
> + :param worth_saving_limit: when the exact number of changed entries
> + is known, only bother saving the dirstate if more than this
> + count of entries have changed. -1 means never save.
>
> I'd personally prefer "None means never save", but that's not a big deal.
>
> I believe the dirstate can record changes to tree state, not just cache stat info...
> so is there a risk with this patch that e.g. "bzr mv foo bar" will go
unrecorded because it's under the _worth_saving_limit? I understand the
value in not updating the cache for only a small number of changes, but
AFAIK we always want to update the dirstate file when actual state is
changed. I think the header_too=True flag is used to make sure this
works, but that seems like a pretty indirect way of expressing the
concept that this is a non-optional update.

The *big* think to watch out for is that there is "IN_MEMORY_MODIFIED"
that happens from a stat cache update, which can be trivially ignored.
And there is "IN_MEMORY_MODIFIED" that happens from "bzr add" which
*must* be written out.

I didn't look at the code to see if Ian clearly distinguished them, if
he has, then we are fine.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoogDcACgkQJdeBCYSNAAPFWQCfeAnQLM2UPq6K+BVsleEy+szw
UIsAoKeT7L7Jybvz8vHUZI32kw9XSmNP
=SWQG
-----END PGP SIGNATURE-----

Ian Clatworthy (ian-clatworthy) wrote :

Andrew Bennetts wrote:
> Review: Approve
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.
>
>
Andrew,

Thanks for the review!

I did carefully benchmark a few approaches here and, as discussed on
IRC, not doing anything unless we absolutely have to gives the best
result for status. For other operations (that don't leverage the ability
to track some changes yet), the best time to read is probably just
before we mutate the dirstate. FWIW, profiling showed file reading time
to be a small factor - it was rebuilding the keys in the right order
that mattered. Before mutation, we simply grab them straight out of the
dirstate. After then, we need to rebuild them from the file content.

> I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.
>
>
With this patch, there's no risk here. When follow-on patches arrive for
operations other than status, I'll need to handle those sorts of cases.

> I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.
>
>
I tried this during development. Some low level tests fail that "make
change then read disk content". Otherwise, it looked ok to me.

> + # We only enable smart saving is it hasn't already been disabled
>
> "is" -> "if"
>
> + if self._use_smart_saving is not False:
> + self._use_smart_saving = True
>
> So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?
>
>
As discussed on IRC, I'll make this "if xxx is None".

Once again, thanks for the review.

Ian C.

John A Meinel (jameinel) wrote :

I'd just like to make it clear that I'm "hesitant" on this patch. I'll admit I haven't looked at it closely, but it seems a bit along the lines of a band-aid for now sort of thing.

Especially given the complexity of keeping multiple inventories in sync...

221 + if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
222 + keys = []
223 + for directory in self._dirblocks:
224 + keys.extend([e[0] for e in directory[1]])
225 + else:

What is the use case for calling _build_line_cache when the current state is IN_MEMORY_UNMODIFIED? Certainly save() is already defined as a no-op in that case. I'd rather avoid having extra code paths that aren't really being used.
I guess maybe it is if the header changes and the blocks don't...

I'll just mention that if you look over the code, the only time _header_state = ...MODIFIED is when _dirblock_state = ... MODIFIED

The only reason it has a separate state flag, is because you can read the header without reading the rest of the dirblocks. (At least to date, certainly the knob exists for that to not be true.)
Though if you consider what is in the header:

1) num parents / parent ids (if this changes, the dirblock state certainly should)
2) ghost parents (see 1)
3) crc (Also indicative of actual content changes)
4) num_entries (if this changes, we better have dirblock changes)

review: Needs Information
Robert Collins (lifeless) wrote :

re: safety.

Can I propose that we have an explicit API for just updating the hash
cache (we may already). And that the current state of MODIFIED *always*
refer to having semantic changes rather than hash cache updates.

Then, the hash cache updating code should not alter the dirstate state.
Rather it should just accumulate into a 'hash cache changes counter'.
And save can check that.

This may be very similar to the code. The key thing is that I hope this
will structure things so as to avoid needing to be careful in commands.

-Rob

4385. By Ian Clatworthy on 2009-06-17

merge bzr.dev r4449

4386. By Ian Clatworthy on 2009-06-17

apply review feedback

Unmerged revisions

4386. By Ian Clatworthy on 2009-06-17

apply review feedback

4385. By Ian Clatworthy on 2009-06-17

merge bzr.dev r4449

4384. By Ian Clatworthy on 2009-05-28

support a configurable limit of changes before dirstate saving

4383. By Ian Clatworthy on 2009-05-28

unroll smart saving logic into one loop in one routine

4382. By Ian Clatworthy on 2009-05-28

use a tuple of lists instead of a dict for the line cache

4381. By Ian Clatworthy on 2009-05-28

speed up dirstate serialisation, making status faster

4380. By Ian Clatworthy on 2009-05-27

prepare form smart dirstate saving

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/_dirstate_helpers_c.pyx'
--- bzrlib/_dirstate_helpers_c.pyx 2009-06-09 01:52:10 +0000
+++ bzrlib/_dirstate_helpers_c.pyx 2009-06-16 02:37:03 +0000
@@ -909,7 +909,7 @@
909 else:909 else:
910 entry[1][0] = ('l', '', stat_value.st_size,910 entry[1][0] = ('l', '', stat_value.st_size,
911 False, DirState.NULLSTAT)911 False, DirState.NULLSTAT)
912 self._dirblock_state = DirState.IN_MEMORY_MODIFIED912 self._mark_modified([entry])
913 return link_or_sha1913 return link_or_sha1
914914
915915
916916
=== modified file 'bzrlib/dirstate.py'
--- bzrlib/dirstate.py 2009-06-01 08:55:17 +0000
+++ bzrlib/dirstate.py 2009-06-16 02:37:03 +0000
@@ -362,11 +362,14 @@
362 HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'362 HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
363 HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'363 HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'
364364
365 def __init__(self, path, sha1_provider):365 def __init__(self, path, sha1_provider, worth_saving_limit=0):
366 """Create a DirState object.366 """Create a DirState object.
367367
368 :param path: The path at which the dirstate file on disk should live.368 :param path: The path at which the dirstate file on disk should live.
369 :param sha1_provider: an object meeting the SHA1Provider interface.369 :param sha1_provider: an object meeting the SHA1Provider interface.
370 :param worth_saving_limit: when the exact number of changed entries
371 is known, only bother saving the dirstate if more than this
372 count of entries have changed. -1 means never save.
370 """373 """
371 # _header_state and _dirblock_state represent the current state374 # _header_state and _dirblock_state represent the current state
372 # of the dirstate metadata and the per-row data respectiely.375 # of the dirstate metadata and the per-row data respectiely.
@@ -409,11 +412,48 @@
409 # during commit.412 # during commit.
410 self._last_block_index = None413 self._last_block_index = None
411 self._last_entry_index = None414 self._last_entry_index = None
415 # If True, use the per-entry field cache for faster serialisation.
416 # If False, disable it. If None, it is not used but may be enabled.
417 self._use_smart_saving = None
418 # The set of known changes
419 self._known_changes = set()
420 # The cache of serialised lines. When built, this is a tuple of
421 # 2 sorted lists that we "walk" while serialising.
422 self._line_cache = None
423 # How many changed entries can we have without saving
424 self._worth_saving_limit = worth_saving_limit
412425
413 def __repr__(self):426 def __repr__(self):
414 return "%s(%r)" % \427 return "%s(%r)" % \
415 (self.__class__.__name__, self._filename)428 (self.__class__.__name__, self._filename)
416429
430 def _mark_modified(self, entries=None, header_too=False):
431 """Mark this dirstate as modified.
432
433 :param entries: if non-None, mark just these entries as modified.
434 :param header_too: mark the header modified as well, not just the
435 dirblocks.
436 """
437 #trace.mutter_callsite(3, "modified entries: %s", entries)
438 if entries:
439 self._known_changes.update([e[0] for e in entries])
440 # We only enable smart saving is it hasn't already been disabled
441 if self._use_smart_saving is not False:
442 self._use_smart_saving = True
443 else:
444 # We don't know exactly what changed so disable smart saving
445 self._use_smart_saving = False
446 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
447 if header_too:
448 self._header_state = DirState.IN_MEMORY_MODIFIED
449
450 def _mark_unmodified(self):
451 """Mark this dirstate as unmodified."""
452 self._header_state = DirState.IN_MEMORY_UNMODIFIED
453 self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
454 self._use_smart_saving = None
455 self._known_changes = set()
456
417 def add(self, path, file_id, kind, stat, fingerprint):457 def add(self, path, file_id, kind, stat, fingerprint):
418 """Add a path to be tracked.458 """Add a path to be tracked.
419459
@@ -545,7 +585,7 @@
545 if kind == 'directory':585 if kind == 'directory':
546 # insert a new dirblock586 # insert a new dirblock
547 self._ensure_block(block_index, entry_index, utf8path)587 self._ensure_block(block_index, entry_index, utf8path)
548 self._dirblock_state = DirState.IN_MEMORY_MODIFIED588 self._mark_modified()
549 if self._id_index:589 if self._id_index:
550 self._id_index.setdefault(entry_key[2], set()).add(entry_key)590 self._id_index.setdefault(entry_key[2], set()).add(entry_key)
551591
@@ -1017,8 +1057,7 @@
10171057
1018 self._ghosts = []1058 self._ghosts = []
1019 self._parents = [parents[0]]1059 self._parents = [parents[0]]
1020 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1060 self._mark_modified(header_too=True)
1021 self._header_state = DirState.IN_MEMORY_MODIFIED
10221061
1023 def _empty_parent_info(self):1062 def _empty_parent_info(self):
1024 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -1063 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
@@ -1460,8 +1499,7 @@
1460 # Apply in-situ changes.1499 # Apply in-situ changes.
1461 self._update_basis_apply_changes(changes)1500 self._update_basis_apply_changes(changes)
14621501
1463 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1502 self._mark_modified(header_too=True)
1464 self._header_state = DirState.IN_MEMORY_MODIFIED
1465 self._id_index = None1503 self._id_index = None
1466 return1504 return
14671505
@@ -1594,7 +1632,7 @@
1594 and stat_value.st_ctime < self._cutoff_time):1632 and stat_value.st_ctime < self._cutoff_time):
1595 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],1633 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
1596 packed_stat)1634 packed_stat)
1597 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1635 self._mark_modified()
15981636
1599 def _sha_cutoff_time(self):1637 def _sha_cutoff_time(self):
1600 """Return cutoff time.1638 """Return cutoff time.
@@ -1658,14 +1696,13 @@
1658 """Serialise the entire dirstate to a sequence of lines."""1696 """Serialise the entire dirstate to a sequence of lines."""
1659 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and1697 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
1660 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):1698 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
1661 # read whats on disk.1699 # read what's on disk.
1662 self._state_file.seek(0)1700 self._state_file.seek(0)
1663 return self._state_file.readlines()1701 return self._state_file.readlines()
1664 lines = []1702 lines = []
1665 lines.append(self._get_parents_line(self.get_parent_ids()))1703 lines.append(self._get_parents_line(self.get_parent_ids()))
1666 lines.append(self._get_ghosts_line(self._ghosts))1704 lines.append(self._get_ghosts_line(self._ghosts))
1667 # append the root line which is special cased1705 lines.extend(self._get_entry_lines())
1668 lines.extend(map(self._entry_to_line, self._iter_entries()))
1669 return self._get_output_lines(lines)1706 return self._get_output_lines(lines)
16701707
1671 def _get_ghosts_line(self, ghost_ids):1708 def _get_ghosts_line(self, ghost_ids):
@@ -1676,6 +1713,37 @@
1676 """Create a line for the state file for parents information."""1713 """Create a line for the state file for parents information."""
1677 return '\0'.join([str(len(parent_ids))] + parent_ids)1714 return '\0'.join([str(len(parent_ids))] + parent_ids)
16781715
1716 def _get_entry_lines(self):
1717 """Create lines for entries."""
1718 if self._use_smart_saving:
1719 # Build the line cache if it hasn't already been built
1720 self._build_line_cache()
1721 # We unroll this case for better performance ...
1722 # The line cache is a tuple of 2 ordered lists: keys and lines.
1723 # We keep track of successful matches and only search from there
1724 # on next time.
1725 entry_to_line = self._entry_to_line
1726 known_changes = self._known_changes
1727 index = 0
1728 keys, serialised = self._line_cache
1729 result = []
1730 for entry in self._iter_entries():
1731 key = entry[0]
1732 if key in known_changes:
1733 result.append(entry_to_line(entry))
1734 else:
1735 if keys[index] != key:
1736 try:
1737 index = keys.index(key, index + 1)
1738 except ValueError:
1739 result.append(entry_to_line(entry))
1740 continue
1741 result.append(serialised[index])
1742 index += 1
1743 return result
1744 else:
1745 return map(self._entry_to_line, self._iter_entries())
1746
1679 def _get_fields_to_entry(self):1747 def _get_fields_to_entry(self):
1680 """Get a function which converts entry fields into a entry record.1748 """Get a function which converts entry fields into a entry record.
16811749
@@ -2034,17 +2102,21 @@
2034 return len(self._parents) - len(self._ghosts)2102 return len(self._parents) - len(self._ghosts)
20352103
2036 @staticmethod2104 @staticmethod
2037 def on_file(path, sha1_provider=None):2105 def on_file(path, sha1_provider=None, worth_saving_limit=0):
2038 """Construct a DirState on the file at path "path".2106 """Construct a DirState on the file at path "path".
20392107
2040 :param path: The path at which the dirstate file on disk should live.2108 :param path: The path at which the dirstate file on disk should live.
2041 :param sha1_provider: an object meeting the SHA1Provider interface.2109 :param sha1_provider: an object meeting the SHA1Provider interface.
2042 If None, a DefaultSHA1Provider is used.2110 If None, a DefaultSHA1Provider is used.
2111 :param worth_saving_limit: when the exact number of changed entries
2112 is known, only bother saving the dirstate if more than this
2113 count of entries have changed. -1 means never save.
2043 :return: An unlocked DirState object, associated with the given path.2114 :return: An unlocked DirState object, associated with the given path.
2044 """2115 """
2045 if sha1_provider is None:2116 if sha1_provider is None:
2046 sha1_provider = DefaultSHA1Provider()2117 sha1_provider = DefaultSHA1Provider()
2047 result = DirState(path, sha1_provider)2118 result = DirState(path, sha1_provider,
2119 worth_saving_limit=worth_saving_limit)
2048 return result2120 return result
20492121
2050 def _read_dirblocks_if_needed(self):2122 def _read_dirblocks_if_needed(self):
@@ -2058,6 +2130,36 @@
2058 if self._dirblock_state == DirState.NOT_IN_MEMORY:2130 if self._dirblock_state == DirState.NOT_IN_MEMORY:
2059 _read_dirblocks(self)2131 _read_dirblocks(self)
20602132
2133 def _build_line_cache(self):
2134 """Build the line cache.
2135
2136 The line cache maps entry keys to serialised lines via
2137 a tuple of 2 sorted lists.
2138 """
2139 if self._line_cache is not None:
2140 # already built
2141 return
2142 self._state_file.seek(0)
2143 lines = self._state_file.readlines()
2144 # There are 5 header lines: 3 in the prelude, a line for
2145 # parents and a line for ghosts. There is also a trailing
2146 # empty line. We skip over those.
2147 # Each line starts with a null and ends with a null and
2148 # newline. We don't keep those because the serialisation
2149 # process adds them.
2150 values = [l[1:-2] for l in lines[5:-1]]
2151 if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
2152 keys = []
2153 for directory in self._dirblocks:
2154 keys.extend([e[0] for e in directory[1]])
2155 else:
2156 # Be safe and calculate the keys from the lines
2157 keys = []
2158 for v in values:
2159 fields = v.split('\0', 3)
2160 keys.append((fields[0], fields[1], fields[2]))
2161 self._line_cache = (keys, values)
2162
2061 def _read_header(self):2163 def _read_header(self):
2062 """This reads in the metadata header, and the parent ids.2164 """This reads in the metadata header, and the parent ids.
20632165
@@ -2142,9 +2244,7 @@
2142 trace.mutter('Not saving DirState because '2244 trace.mutter('Not saving DirState because '
2143 '_changes_aborted is set.')2245 '_changes_aborted is set.')
2144 return2246 return
2145 if (self._header_state == DirState.IN_MEMORY_MODIFIED or2247 if self._worth_saving():
2146 self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
2147
2148 grabbed_write_lock = False2248 grabbed_write_lock = False
2149 if self._lock_state != 'w':2249 if self._lock_state != 'w':
2150 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()2250 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()
@@ -2158,12 +2258,12 @@
2158 # We couldn't grab a write lock, so we switch back to a read one2258 # We couldn't grab a write lock, so we switch back to a read one
2159 return2259 return
2160 try:2260 try:
2261 lines = self.get_lines()
2161 self._state_file.seek(0)2262 self._state_file.seek(0)
2162 self._state_file.writelines(self.get_lines())2263 self._state_file.writelines(lines)
2163 self._state_file.truncate()2264 self._state_file.truncate()
2164 self._state_file.flush()2265 self._state_file.flush()
2165 self._header_state = DirState.IN_MEMORY_UNMODIFIED2266 self._mark_unmodified()
2166 self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
2167 finally:2267 finally:
2168 if grabbed_write_lock:2268 if grabbed_write_lock:
2169 self._lock_token = self._lock_token.restore_read_lock()2269 self._lock_token = self._lock_token.restore_read_lock()
@@ -2172,6 +2272,26 @@
2172 # not changed contents. Since restore_read_lock may2272 # not changed contents. Since restore_read_lock may
2173 # not be an atomic operation.2273 # not be an atomic operation.
21742274
2275 def _worth_saving(self):
2276 """Is it worth saving the dirstate or not?"""
2277 # Header changes are probably important enough to always save
2278 if self._header_state == DirState.IN_MEMORY_MODIFIED:
2279 return True
2280 if (self._dirblock_state == DirState.IN_MEMORY_MODIFIED and
2281 self._worth_saving_limit != -1):
2282 # If we're using smart saving and only a small number of
2283 # entries have changed, don't bother saving. John has
2284 # suggested using a heuristic here based on the size of the
2285 # changed files and/or tree. For now, we go with a configurable
2286 # number of changes, keeping the calculation time
2287 # as low overhead as possible. (This also keeps all existing
2288 # tests passing as the default is 0, i.e. always save.)
2289 if self._use_smart_saving:
2290 return len(self._known_changes) > self._worth_saving_limit
2291 else:
2292 return True
2293 return False
2294
2175 def _set_data(self, parent_ids, dirblocks):2295 def _set_data(self, parent_ids, dirblocks):
2176 """Set the full dirstate data in memory.2296 """Set the full dirstate data in memory.
21772297
@@ -2185,8 +2305,7 @@
2185 """2305 """
2186 # our memory copy is now authoritative.2306 # our memory copy is now authoritative.
2187 self._dirblocks = dirblocks2307 self._dirblocks = dirblocks
2188 self._header_state = DirState.IN_MEMORY_MODIFIED2308 self._mark_modified(header_too=True)
2189 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
2190 self._parents = list(parent_ids)2309 self._parents = list(parent_ids)
2191 self._id_index = None2310 self._id_index = None
2192 self._packed_stat_index = None2311 self._packed_stat_index = None
@@ -2212,7 +2331,7 @@
2212 self._make_absent(entry)2331 self._make_absent(entry)
2213 self.update_minimal(('', '', new_id), 'd',2332 self.update_minimal(('', '', new_id), 'd',
2214 path_utf8='', packed_stat=entry[1][0][4])2333 path_utf8='', packed_stat=entry[1][0][4])
2215 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2334 self._mark_modified()
2216 if self._id_index is not None:2335 if self._id_index is not None:
2217 self._id_index.setdefault(new_id, set()).add(entry[0])2336 self._id_index.setdefault(new_id, set()).add(entry[0])
22182337
@@ -2352,8 +2471,7 @@
2352 self._entries_to_current_state(new_entries)2471 self._entries_to_current_state(new_entries)
2353 self._parents = [rev_id for rev_id, tree in trees]2472 self._parents = [rev_id for rev_id, tree in trees]
2354 self._ghosts = list(ghosts)2473 self._ghosts = list(ghosts)
2355 self._header_state = DirState.IN_MEMORY_MODIFIED2474 self._mark_modified(header_too=True)
2356 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
2357 self._id_index = id_index2475 self._id_index = id_index
23582476
2359 def _sort_entries(self, entry_list):2477 def _sort_entries(self, entry_list):
@@ -2471,7 +2589,7 @@
2471 # without seeing it in the new list. so it must be gone.2589 # without seeing it in the new list. so it must be gone.
2472 self._make_absent(current_old)2590 self._make_absent(current_old)
2473 current_old = advance(old_iterator)2591 current_old = advance(old_iterator)
2474 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2592 self._mark_modified()
2475 self._id_index = None2593 self._id_index = None
2476 self._packed_stat_index = None2594 self._packed_stat_index = None
24772595
@@ -2524,7 +2642,7 @@
2524 if update_tree_details[0][0] == 'a': # absent2642 if update_tree_details[0][0] == 'a': # absent
2525 raise AssertionError('bad row %r' % (update_tree_details,))2643 raise AssertionError('bad row %r' % (update_tree_details,))
2526 update_tree_details[0] = DirState.NULL_PARENT_DETAILS2644 update_tree_details[0] = DirState.NULL_PARENT_DETAILS
2527 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2645 self._mark_modified()
2528 return last_reference2646 return last_reference
25292647
2530 def update_minimal(self, key, minikind, executable=False, fingerprint='',2648 def update_minimal(self, key, minikind, executable=False, fingerprint='',
@@ -2650,7 +2768,7 @@
2650 if not present:2768 if not present:
2651 self._dirblocks.insert(block_index, (subdir_key[0], []))2769 self._dirblocks.insert(block_index, (subdir_key[0], []))
26522770
2653 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2771 self._mark_modified()
26542772
2655 def _validate(self):2773 def _validate(self):
2656 """Check that invariants on the dirblock are correct.2774 """Check that invariants on the dirblock are correct.
@@ -2936,7 +3054,7 @@
2936 else:3054 else:
2937 entry[1][0] = ('l', '', stat_value.st_size,3055 entry[1][0] = ('l', '', stat_value.st_size,
2938 False, DirState.NULLSTAT)3056 False, DirState.NULLSTAT)
2939 state._dirblock_state = DirState.IN_MEMORY_MODIFIED3057 state._mark_modified([entry])
2940 return link_or_sha13058 return link_or_sha1
2941update_entry = py_update_entry3059update_entry = py_update_entry
29423060
29433061
=== modified file 'bzrlib/workingtree_4.py'
--- bzrlib/workingtree_4.py 2009-06-11 04:23:53 +0000
+++ bzrlib/workingtree_4.py 2009-06-16 02:37:03 +0000
@@ -248,7 +248,7 @@
248 local_path = self.bzrdir.get_workingtree_transport(None248 local_path = self.bzrdir.get_workingtree_transport(None
249 ).local_abspath('dirstate')249 ).local_abspath('dirstate')
250 self._dirstate = dirstate.DirState.on_file(local_path,250 self._dirstate = dirstate.DirState.on_file(local_path,
251 self._sha1_provider())251 self._sha1_provider(), self._worth_saving_limit())
252 return self._dirstate252 return self._dirstate
253253
254 def _sha1_provider(self):254 def _sha1_provider(self):
@@ -263,6 +263,15 @@
263 else:263 else:
264 return None264 return None
265265
266 def _worth_saving_limit(self):
267 """How many changes are ok before we must save the dirstate.
268
269 :return: an integer. -1 means never save.
270 """
271 # XXX: In the future, we could return -1 for logical read-only
272 # operations like status. For now, just use a small number.
273 return 10
274
266 def filter_unversioned_files(self, paths):275 def filter_unversioned_files(self, paths):
267 """Filter out paths that are versioned.276 """Filter out paths that are versioned.
268277
@@ -863,7 +872,7 @@
863 rollback_rename()872 rollback_rename()
864 raise873 raise
865 result.append((from_rel, to_rel))874 result.append((from_rel, to_rel))
866 state._dirblock_state = dirstate.DirState.IN_MEMORY_MODIFIED875 state._mark_modified()
867 self._make_dirty(reset_inventory=False)876 self._make_dirty(reset_inventory=False)
868877
869 return result878 return result