Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old
- faster-dirstate-saving
- Merge into trunk-old
Status: | Rejected | ||||
---|---|---|---|---|---|
Rejected by: | John A Meinel | ||||
Proposed branch: | lp:~ian-clatworthy/bzr/faster-dirstate-saving | ||||
Merge into: | lp:~bzr/bzr/trunk-old | ||||
Diff against target: |
351 lines (has conflicts)
Text conflict in NEWS |
||||
To merge this branch: | bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving | ||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Martin Pool | Needs Fixing | ||
Review via email: mp+7537@code.launchpad.net |
This proposal supersedes a proposal from 2009-05-28.
Commit message
Description of the change
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal | # |
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal | # |
Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal | # |
> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)
Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.
A guideline is: "Don't save if it would cost as much data as you avoid
reading".
The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.
If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAko
Y6MAn1FfgUIcKOS
=i1ML
-----END PGP SIGNATURE-----
Martin Pool (mbp) wrote : Posted in a previous version of this proposal | # |
2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.
It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.
--
Martin <http://
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal | # |
This updated merge proposal makes it configurable as to how many changes are ok before the dirstate should be saved. As Martin requested, there's the ability to turn off saving altogether. After this lands, we can use that to implement various policies (e.g. don't save on a logical read-only operation) as we see fit.
This drops 'bzr st [file]' down by 0.5+ seconds on OOo. Nice. It doesn't yet make a difference for add or commit. Those are small follow-on patches after we're happy with this one.
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal | # |
I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.
Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)
It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.
+ :param worth_saving_limit: when the exact number of changed entries
+ is known, only bother saving the dirstate if more than this
+ count of entries have changed. -1 means never save.
I'd personally prefer "None means never save", but that's not a big deal.
I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_
I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.
+ # We only enable smart saving is it hasn't already been disabled
"is" -> "if"
+ if self._use_
+ self._use_
So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?
(If you really are just testing for True/False, it's generally a bad idea to test for True or False with "is". Use "if [not] self._use_
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Bennetts wrote:
> Review: Approve
> I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.
>
> Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file.
I'm a bit surprised that re-reading the file is significantly faster
than converting a tuple of strings back into a serialized string. If it
is significant, than I'd certainly rather a pyrex extension than reading
from disk...
>
> + :param worth_saving_limit: when the exact number of changed entries
> + is known, only bother saving the dirstate if more than this
> + count of entries have changed. -1 means never save.
>
> I'd personally prefer "None means never save", but that's not a big deal.
>
> I believe the dirstate can record changes to tree state, not just cache stat info...
> so is there a risk with this patch that e.g. "bzr mv foo bar" will go
unrecorded because it's under the _worth_
value in not updating the cache for only a small number of changes, but
AFAIK we always want to update the dirstate file when actual state is
changed. I think the header_too=True flag is used to make sure this
works, but that seems like a pretty indirect way of expressing the
concept that this is a non-optional update.
The *big* think to watch out for is that there is "IN_MEMORY_
that happens from a stat cache update, which can be trivially ignored.
And there is "IN_MEMORY_
*must* be written out.
I didn't look at the code to see if Ian clearly distinguished them, if
he has, then we are fine.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAko
UIsAoKeT7L7Jybv
=SWQG
-----END PGP SIGNATURE-----
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal | # |
Andrew Bennetts wrote:
> Review: Approve
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.
>
>
Andrew,
Thanks for the review!
I did carefully benchmark a few approaches here and, as discussed on
IRC, not doing anything unless we absolutely have to gives the best
result for status. For other operations (that don't leverage the ability
to track some changes yet), the best time to read is probably just
before we mutate the dirstate. FWIW, profiling showed file reading time
to be a small factor - it was rebuilding the keys in the right order
that mattered. Before mutation, we simply grab them straight out of the
dirstate. After then, we need to rebuild them from the file content.
> I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_
>
>
With this patch, there's no risk here. When follow-on patches arrive for
operations other than status, I'll need to handle those sorts of cases.
> I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.
>
>
I tried this during development. Some low level tests fail that "make
change then read disk content". Otherwise, it looked ok to me.
> + # We only enable smart saving is it hasn't already been disabled
>
> "is" -> "if"
>
> + if self._use_
> + self._use_
>
> So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?
>
>
As discussed on IRC, I'll make this "if xxx is None".
Once again, thanks for the review.
Ian C.
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
I'd just like to make it clear that I'm "hesitant" on this patch. I'll admit I haven't looked at it closely, but it seems a bit along the lines of a band-aid for now sort of thing.
Especially given the complexity of keeping multiple inventories in sync...
221 + if self._dirblock_
222 + keys = []
223 + for directory in self._dirblocks:
224 + keys.extend([e[0] for e in directory[1]])
225 + else:
What is the use case for calling _build_line_cache when the current state is IN_MEMORY_
I guess maybe it is if the header changes and the blocks don't...
I'll just mention that if you look over the code, the only time _header_state = ...MODIFIED is when _dirblock_state = ... MODIFIED
The only reason it has a separate state flag, is because you can read the header without reading the rest of the dirblocks. (At least to date, certainly the knob exists for that to not be true.)
Though if you consider what is in the header:
1) num parents / parent ids (if this changes, the dirblock state certainly should)
2) ghost parents (see 1)
3) crc (Also indicative of actual content changes)
4) num_entries (if this changes, we better have dirblock changes)
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal | # |
re: safety.
Can I propose that we have an explicit API for just updating the hash
cache (we may already). And that the current state of MODIFIED *always*
refer to having semantic changes rather than hash cache updates.
Then, the hash cache updating code should not alter the dirstate state.
Rather it should just accumulate into a 'hash cache changes counter'.
And save can check that.
This may be very similar to the code. The key thing is that I hope this
will structure things so as to avoid needing to be careful in commands.
-Rob
Ian Clatworthy (ian-clatworthy) wrote : | # |
Thanks to Andrew, John and Robert for their feedback on my earlier merge proposal. This updated one takes their comments into account, particularly the bit about optional vs mandatory (hash change vs semantically important) updating. It also drops the "line cache" as there are better ways (e.g. pyrex) or solving serialisation speed (which we can look into later).
As before, this improves 'bzr status' performance on OpenOffice.org by 0.5 seconds after changing a small number of files. As mentioned in NEWS, full status drops from 1.6 to 1.1 seconds while status on just a FILE drops from 1.0 to 0.5 seconds.
Martin Pool (mbp) wrote : | # |
It looks like you're using _worth_saving_limit as both a method and a variable on similar classes, which is at least potentially confusing if not actually buggy....
Martin Pool (mbp) wrote : | # |
> It looks like you're using _worth_saving_limit as both a method and a variable
> on similar classes, which is at least potentially confusing if not actually
> buggy....
lifeless: poolie: the last one I read looked like it could cause confusion about the internal state; jam's comments were along those lines and I agreed.
lifeless: I don't know if its been improved since then; and while its a functional improvement it is a risky area of code.
Combined with my comments I think this may need to be resubmitted.
Martin Pool (mbp) wrote : | # |
I'm going to say Reject to get it out of the queue; the idea is good but this needs updates.
Ian Clatworthy (ian-clatworthy) wrote : | # |
I was happy to take this out of the queue while we focussed on 2.0. Right now though, I'm pretty confused here as to why this is rejected. Can someone do a proper review please and tell me exactly what's wrong?
To my understanding, I addressed the first round of comments. If I need to rename a variable or method, that's approve-
John A Meinel (jameinel) wrote : | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
> I was happy to take this out of the queue while we focussed on 2.0. Right now though, I'm pretty confused here as to why this is rejected. Can someone do a proper review please and tell me exactly what's wrong?
>
> To my understanding, I addressed the first round of comments. If I need to rename a variable or method, that's approve-
>
To clarify
I was cleaning up the queue a bit, where we had lots of stuff marked one
way or another in votes, but not in the actual merge status. The last
statement on this was "I'll Reject for now" from Martin, but he had not
actually set the status, so I did.
My understanding was that this was meant to be, please address some of
the issues and resubmit.
Unfortunately, there is no "Resubmit" object for a merge proposal.
Potentially we could have used "Work in Progress", which I think takes
it out of the queue.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAkr
mugAoLMztcg7TTw
=wbea
-----END PGP SIGNATURE-----
Preview Diff
1 | === modified file 'NEWS' | |||
2 | --- NEWS 2009-08-30 23:51:10 +0000 | |||
3 | +++ NEWS 2009-08-31 04:37:34 +0000 | |||
4 | @@ -771,6 +771,7 @@ | |||
5 | 771 | Improvements | 771 | Improvements |
6 | 772 | ************ | 772 | ************ |
7 | 773 | 773 | ||
8 | 774 | <<<<<<< TREE | ||
9 | 774 | * ``bzr annotate`` can now be significantly faster. The time for | 775 | * ``bzr annotate`` can now be significantly faster. The time for |
10 | 775 | ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long | 776 | ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long |
11 | 776 | histories and lots of 'duplicate insertions' will be improved more than | 777 | histories and lots of 'duplicate insertions' will be improved more than |
12 | @@ -786,6 +787,12 @@ | |||
13 | 786 | * Initial commit performance in ``--2a`` repositories has been improved by | 787 | * Initial commit performance in ``--2a`` repositories has been improved by |
14 | 787 | making it cheaper to build the initial CHKMap. (John Arbash Meinel) | 788 | making it cheaper to build the initial CHKMap. (John Arbash Meinel) |
15 | 788 | 789 | ||
16 | 790 | ======= | ||
17 | 791 | * ``bzr status [FILE]`` is now faster. On OpenOffice.org after changing a | ||
18 | 792 | file, the time improves from 1.6 to 1.1 seconds for full status and | ||
19 | 793 | 1.0 to 0.5 seconds for status on the file. (Ian Clatworthy) | ||
20 | 794 | |||
21 | 795 | >>>>>>> MERGE-SOURCE | ||
22 | 789 | * Resolving a revno to a revision id on a branch accessed via ``bzr://`` | 796 | * Resolving a revno to a revision id on a branch accessed via ``bzr://`` |
23 | 790 | or ``bzr+ssh://`` is now much faster and involves no VFS operations. | 797 | or ``bzr+ssh://`` is now much faster and involves no VFS operations. |
24 | 791 | This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts) | 798 | This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts) |
25 | 792 | 799 | ||
26 | === modified file 'bzrlib/_dirstate_helpers_pyx.pyx' | |||
27 | --- bzrlib/_dirstate_helpers_pyx.pyx 2009-08-28 05:00:33 +0000 | |||
28 | +++ bzrlib/_dirstate_helpers_pyx.pyx 2009-08-31 04:37:34 +0000 | |||
29 | @@ -911,7 +911,7 @@ | |||
30 | 911 | else: | 911 | else: |
31 | 912 | entry[1][0] = ('l', '', stat_value.st_size, | 912 | entry[1][0] = ('l', '', stat_value.st_size, |
32 | 913 | False, DirState.NULLSTAT) | 913 | False, DirState.NULLSTAT) |
34 | 914 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 914 | self._mark_modified([entry]) |
35 | 915 | return link_or_sha1 | 915 | return link_or_sha1 |
36 | 916 | 916 | ||
37 | 917 | 917 | ||
38 | 918 | 918 | ||
39 | === modified file 'bzrlib/dirstate.py' | |||
40 | --- bzrlib/dirstate.py 2009-08-28 05:00:33 +0000 | |||
41 | +++ bzrlib/dirstate.py 2009-08-31 04:37:34 +0000 | |||
42 | @@ -363,11 +363,14 @@ | |||
43 | 363 | HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n' | 363 | HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n' |
44 | 364 | HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n' | 364 | HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n' |
45 | 365 | 365 | ||
47 | 366 | def __init__(self, path, sha1_provider): | 366 | def __init__(self, path, sha1_provider, worth_saving_limit=0): |
48 | 367 | """Create a DirState object. | 367 | """Create a DirState object. |
49 | 368 | 368 | ||
50 | 369 | :param path: The path at which the dirstate file on disk should live. | 369 | :param path: The path at which the dirstate file on disk should live. |
51 | 370 | :param sha1_provider: an object meeting the SHA1Provider interface. | 370 | :param sha1_provider: an object meeting the SHA1Provider interface. |
52 | 371 | :param worth_saving_limit: when the exact number of hash changed | ||
53 | 372 | entries is known, only bother saving the dirstate if more than | ||
54 | 373 | this count of entries have changed. -1 means never save. | ||
55 | 371 | """ | 374 | """ |
56 | 372 | # _header_state and _dirblock_state represent the current state | 375 | # _header_state and _dirblock_state represent the current state |
57 | 373 | # of the dirstate metadata and the per-row data respectiely. | 376 | # of the dirstate metadata and the per-row data respectiely. |
58 | @@ -410,11 +413,47 @@ | |||
59 | 410 | # during commit. | 413 | # during commit. |
60 | 411 | self._last_block_index = None | 414 | self._last_block_index = None |
61 | 412 | self._last_entry_index = None | 415 | self._last_entry_index = None |
62 | 416 | # The set of known hash changes | ||
63 | 417 | self._known_hash_changes = set() | ||
64 | 418 | # How many hash changed entries can we have without saving | ||
65 | 419 | self._worth_saving_limit = worth_saving_limit | ||
66 | 420 | # If True, consider the worth saving limit when deciding whether to | ||
67 | 421 | # save the dirstate or not. If False, ignore it. If None, it can be | ||
68 | 422 | # set True but isn't True yet. | ||
69 | 423 | self._use_smart_saving = None | ||
70 | 413 | 424 | ||
71 | 414 | def __repr__(self): | 425 | def __repr__(self): |
72 | 415 | return "%s(%r)" % \ | 426 | return "%s(%r)" % \ |
73 | 416 | (self.__class__.__name__, self._filename) | 427 | (self.__class__.__name__, self._filename) |
74 | 417 | 428 | ||
75 | 429 | def _mark_modified(self, hash_changed_entries=None, header_too=False): | ||
76 | 430 | """Mark this dirstate as modified. | ||
77 | 431 | |||
78 | 432 | :param hash_changed_entries: if non-None, mark just these entries as | ||
79 | 433 | having their hash modified. | ||
80 | 434 | :param header_too: mark the header modified as well, not just the | ||
81 | 435 | dirblocks. | ||
82 | 436 | """ | ||
83 | 437 | #trace.mutter_callsite(3, "modified hash entries: %s", hash_changed_entries) | ||
84 | 438 | if hash_changed_entries: | ||
85 | 439 | self._known_hash_changes.update([e[0] for e in hash_changed_entries]) | ||
86 | 440 | # We only enable smart saving is it hasn't already been disabled | ||
87 | 441 | if self._use_smart_saving is None: | ||
88 | 442 | self._use_smart_saving = True | ||
89 | 443 | else: | ||
90 | 444 | # We don't know exactly what changed so disable smart saving | ||
91 | 445 | self._use_smart_saving = False | ||
92 | 446 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | ||
93 | 447 | if header_too: | ||
94 | 448 | self._header_state = DirState.IN_MEMORY_MODIFIED | ||
95 | 449 | |||
96 | 450 | def _mark_unmodified(self): | ||
97 | 451 | """Mark this dirstate as unmodified.""" | ||
98 | 452 | self._header_state = DirState.IN_MEMORY_UNMODIFIED | ||
99 | 453 | self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED | ||
100 | 454 | self._use_smart_saving = None | ||
101 | 455 | self._known_hash_changes = set() | ||
102 | 456 | |||
103 | 418 | def add(self, path, file_id, kind, stat, fingerprint): | 457 | def add(self, path, file_id, kind, stat, fingerprint): |
104 | 419 | """Add a path to be tracked. | 458 | """Add a path to be tracked. |
105 | 420 | 459 | ||
106 | @@ -546,7 +585,7 @@ | |||
107 | 546 | if kind == 'directory': | 585 | if kind == 'directory': |
108 | 547 | # insert a new dirblock | 586 | # insert a new dirblock |
109 | 548 | self._ensure_block(block_index, entry_index, utf8path) | 587 | self._ensure_block(block_index, entry_index, utf8path) |
111 | 549 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 588 | self._mark_modified() |
112 | 550 | if self._id_index: | 589 | if self._id_index: |
113 | 551 | self._id_index.setdefault(entry_key[2], set()).add(entry_key) | 590 | self._id_index.setdefault(entry_key[2], set()).add(entry_key) |
114 | 552 | 591 | ||
115 | @@ -1018,8 +1057,7 @@ | |||
116 | 1018 | 1057 | ||
117 | 1019 | self._ghosts = [] | 1058 | self._ghosts = [] |
118 | 1020 | self._parents = [parents[0]] | 1059 | self._parents = [parents[0]] |
121 | 1021 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 1060 | self._mark_modified(header_too=True) |
120 | 1022 | self._header_state = DirState.IN_MEMORY_MODIFIED | ||
122 | 1023 | 1061 | ||
123 | 1024 | def _empty_parent_info(self): | 1062 | def _empty_parent_info(self): |
124 | 1025 | return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) - | 1063 | return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) - |
125 | @@ -1555,8 +1593,7 @@ | |||
126 | 1555 | # the active tree. | 1593 | # the active tree. |
127 | 1556 | raise errors.InconsistentDeltaDelta(delta, "error from _get_entry.") | 1594 | raise errors.InconsistentDeltaDelta(delta, "error from _get_entry.") |
128 | 1557 | 1595 | ||
131 | 1558 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 1596 | self._mark_modified(header_too=True) |
130 | 1559 | self._header_state = DirState.IN_MEMORY_MODIFIED | ||
132 | 1560 | self._id_index = None | 1597 | self._id_index = None |
133 | 1561 | return | 1598 | return |
134 | 1562 | 1599 | ||
135 | @@ -1735,7 +1772,7 @@ | |||
136 | 1735 | and stat_value.st_ctime < self._cutoff_time): | 1772 | and stat_value.st_ctime < self._cutoff_time): |
137 | 1736 | entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3], | 1773 | entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3], |
138 | 1737 | packed_stat) | 1774 | packed_stat) |
140 | 1738 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 1775 | self._mark_modified() |
141 | 1739 | 1776 | ||
142 | 1740 | def _sha_cutoff_time(self): | 1777 | def _sha_cutoff_time(self): |
143 | 1741 | """Return cutoff time. | 1778 | """Return cutoff time. |
144 | @@ -1799,14 +1836,13 @@ | |||
145 | 1799 | """Serialise the entire dirstate to a sequence of lines.""" | 1836 | """Serialise the entire dirstate to a sequence of lines.""" |
146 | 1800 | if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and | 1837 | if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and |
147 | 1801 | self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED): | 1838 | self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED): |
149 | 1802 | # read whats on disk. | 1839 | # read what's on disk. |
150 | 1803 | self._state_file.seek(0) | 1840 | self._state_file.seek(0) |
151 | 1804 | return self._state_file.readlines() | 1841 | return self._state_file.readlines() |
152 | 1805 | lines = [] | 1842 | lines = [] |
153 | 1806 | lines.append(self._get_parents_line(self.get_parent_ids())) | 1843 | lines.append(self._get_parents_line(self.get_parent_ids())) |
154 | 1807 | lines.append(self._get_ghosts_line(self._ghosts)) | 1844 | lines.append(self._get_ghosts_line(self._ghosts)) |
157 | 1808 | # append the root line which is special cased | 1845 | lines.extend(self._get_entry_lines()) |
156 | 1809 | lines.extend(map(self._entry_to_line, self._iter_entries())) | ||
158 | 1810 | return self._get_output_lines(lines) | 1846 | return self._get_output_lines(lines) |
159 | 1811 | 1847 | ||
160 | 1812 | def _get_ghosts_line(self, ghost_ids): | 1848 | def _get_ghosts_line(self, ghost_ids): |
161 | @@ -1817,6 +1853,10 @@ | |||
162 | 1817 | """Create a line for the state file for parents information.""" | 1853 | """Create a line for the state file for parents information.""" |
163 | 1818 | return '\0'.join([str(len(parent_ids))] + parent_ids) | 1854 | return '\0'.join([str(len(parent_ids))] + parent_ids) |
164 | 1819 | 1855 | ||
165 | 1856 | def _get_entry_lines(self): | ||
166 | 1857 | """Create lines for entries.""" | ||
167 | 1858 | return map(self._entry_to_line, self._iter_entries()) | ||
168 | 1859 | |||
169 | 1820 | def _get_fields_to_entry(self): | 1860 | def _get_fields_to_entry(self): |
170 | 1821 | """Get a function which converts entry fields into a entry record. | 1861 | """Get a function which converts entry fields into a entry record. |
171 | 1822 | 1862 | ||
172 | @@ -2175,17 +2215,21 @@ | |||
173 | 2175 | return len(self._parents) - len(self._ghosts) | 2215 | return len(self._parents) - len(self._ghosts) |
174 | 2176 | 2216 | ||
175 | 2177 | @staticmethod | 2217 | @staticmethod |
177 | 2178 | def on_file(path, sha1_provider=None): | 2218 | def on_file(path, sha1_provider=None, worth_saving_limit=0): |
178 | 2179 | """Construct a DirState on the file at path "path". | 2219 | """Construct a DirState on the file at path "path". |
179 | 2180 | 2220 | ||
180 | 2181 | :param path: The path at which the dirstate file on disk should live. | 2221 | :param path: The path at which the dirstate file on disk should live. |
181 | 2182 | :param sha1_provider: an object meeting the SHA1Provider interface. | 2222 | :param sha1_provider: an object meeting the SHA1Provider interface. |
182 | 2183 | If None, a DefaultSHA1Provider is used. | 2223 | If None, a DefaultSHA1Provider is used. |
183 | 2224 | :param worth_saving_limit: when the exact number of hash changed | ||
184 | 2225 | entries is known, only bother saving the dirstate if more than | ||
185 | 2226 | this count of entries have changed. -1 means never save. | ||
186 | 2184 | :return: An unlocked DirState object, associated with the given path. | 2227 | :return: An unlocked DirState object, associated with the given path. |
187 | 2185 | """ | 2228 | """ |
188 | 2186 | if sha1_provider is None: | 2229 | if sha1_provider is None: |
189 | 2187 | sha1_provider = DefaultSHA1Provider() | 2230 | sha1_provider = DefaultSHA1Provider() |
191 | 2188 | result = DirState(path, sha1_provider) | 2231 | result = DirState(path, sha1_provider, |
192 | 2232 | worth_saving_limit=worth_saving_limit) | ||
193 | 2189 | return result | 2233 | return result |
194 | 2190 | 2234 | ||
195 | 2191 | def _read_dirblocks_if_needed(self): | 2235 | def _read_dirblocks_if_needed(self): |
196 | @@ -2283,9 +2327,7 @@ | |||
197 | 2283 | trace.mutter('Not saving DirState because ' | 2327 | trace.mutter('Not saving DirState because ' |
198 | 2284 | '_changes_aborted is set.') | 2328 | '_changes_aborted is set.') |
199 | 2285 | return | 2329 | return |
203 | 2286 | if (self._header_state == DirState.IN_MEMORY_MODIFIED or | 2330 | if self._worth_saving(): |
201 | 2287 | self._dirblock_state == DirState.IN_MEMORY_MODIFIED): | ||
202 | 2288 | |||
204 | 2289 | grabbed_write_lock = False | 2331 | grabbed_write_lock = False |
205 | 2290 | if self._lock_state != 'w': | 2332 | if self._lock_state != 'w': |
206 | 2291 | grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock() | 2333 | grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock() |
207 | @@ -2299,12 +2341,12 @@ | |||
208 | 2299 | # We couldn't grab a write lock, so we switch back to a read one | 2341 | # We couldn't grab a write lock, so we switch back to a read one |
209 | 2300 | return | 2342 | return |
210 | 2301 | try: | 2343 | try: |
211 | 2344 | lines = self.get_lines() | ||
212 | 2302 | self._state_file.seek(0) | 2345 | self._state_file.seek(0) |
214 | 2303 | self._state_file.writelines(self.get_lines()) | 2346 | self._state_file.writelines(lines) |
215 | 2304 | self._state_file.truncate() | 2347 | self._state_file.truncate() |
216 | 2305 | self._state_file.flush() | 2348 | self._state_file.flush() |
219 | 2306 | self._header_state = DirState.IN_MEMORY_UNMODIFIED | 2349 | self._mark_unmodified() |
218 | 2307 | self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED | ||
220 | 2308 | finally: | 2350 | finally: |
221 | 2309 | if grabbed_write_lock: | 2351 | if grabbed_write_lock: |
222 | 2310 | self._lock_token = self._lock_token.restore_read_lock() | 2352 | self._lock_token = self._lock_token.restore_read_lock() |
223 | @@ -2313,6 +2355,25 @@ | |||
224 | 2313 | # not changed contents. Since restore_read_lock may | 2355 | # not changed contents. Since restore_read_lock may |
225 | 2314 | # not be an atomic operation. | 2356 | # not be an atomic operation. |
226 | 2315 | 2357 | ||
227 | 2358 | def _worth_saving(self): | ||
228 | 2359 | """Is it worth saving the dirstate or not?""" | ||
229 | 2360 | if self._header_state == DirState.IN_MEMORY_MODIFIED: | ||
230 | 2361 | return True | ||
231 | 2362 | if (self._dirblock_state == DirState.IN_MEMORY_MODIFIED and | ||
232 | 2363 | self._worth_saving_limit != -1): | ||
233 | 2364 | # If we're using smart saving and only a small number of | ||
234 | 2365 | # entries have changed their hash, don't bother saving. John has | ||
235 | 2366 | # suggested using a heuristic here based on the size of the | ||
236 | 2367 | # changed files and/or tree. For now, we go with a configurable | ||
237 | 2368 | # number of changes, keeping the calculation time | ||
238 | 2369 | # as low overhead as possible. (This also keeps all existing | ||
239 | 2370 | # tests passing as the default is 0, i.e. always save.) | ||
240 | 2371 | if self._use_smart_saving: | ||
241 | 2372 | return len(self._known_hash_changes) > self._worth_saving_limit | ||
242 | 2373 | else: | ||
243 | 2374 | return True | ||
244 | 2375 | return False | ||
245 | 2376 | |||
246 | 2316 | def _set_data(self, parent_ids, dirblocks): | 2377 | def _set_data(self, parent_ids, dirblocks): |
247 | 2317 | """Set the full dirstate data in memory. | 2378 | """Set the full dirstate data in memory. |
248 | 2318 | 2379 | ||
249 | @@ -2326,8 +2387,7 @@ | |||
250 | 2326 | """ | 2387 | """ |
251 | 2327 | # our memory copy is now authoritative. | 2388 | # our memory copy is now authoritative. |
252 | 2328 | self._dirblocks = dirblocks | 2389 | self._dirblocks = dirblocks |
255 | 2329 | self._header_state = DirState.IN_MEMORY_MODIFIED | 2390 | self._mark_modified(header_too=True) |
254 | 2330 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | ||
256 | 2331 | self._parents = list(parent_ids) | 2391 | self._parents = list(parent_ids) |
257 | 2332 | self._id_index = None | 2392 | self._id_index = None |
258 | 2333 | self._packed_stat_index = None | 2393 | self._packed_stat_index = None |
259 | @@ -2353,7 +2413,7 @@ | |||
260 | 2353 | self._make_absent(entry) | 2413 | self._make_absent(entry) |
261 | 2354 | self.update_minimal(('', '', new_id), 'd', | 2414 | self.update_minimal(('', '', new_id), 'd', |
262 | 2355 | path_utf8='', packed_stat=entry[1][0][4]) | 2415 | path_utf8='', packed_stat=entry[1][0][4]) |
264 | 2356 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2416 | self._mark_modified() |
265 | 2357 | if self._id_index is not None: | 2417 | if self._id_index is not None: |
266 | 2358 | self._id_index.setdefault(new_id, set()).add(entry[0]) | 2418 | self._id_index.setdefault(new_id, set()).add(entry[0]) |
267 | 2359 | 2419 | ||
268 | @@ -2493,8 +2553,7 @@ | |||
269 | 2493 | self._entries_to_current_state(new_entries) | 2553 | self._entries_to_current_state(new_entries) |
270 | 2494 | self._parents = [rev_id for rev_id, tree in trees] | 2554 | self._parents = [rev_id for rev_id, tree in trees] |
271 | 2495 | self._ghosts = list(ghosts) | 2555 | self._ghosts = list(ghosts) |
274 | 2496 | self._header_state = DirState.IN_MEMORY_MODIFIED | 2556 | self._mark_modified(header_too=True) |
273 | 2497 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | ||
275 | 2498 | self._id_index = id_index | 2557 | self._id_index = id_index |
276 | 2499 | 2558 | ||
277 | 2500 | def _sort_entries(self, entry_list): | 2559 | def _sort_entries(self, entry_list): |
278 | @@ -2637,7 +2696,7 @@ | |||
279 | 2637 | current_old[0][1].decode('utf8')) | 2696 | current_old[0][1].decode('utf8')) |
280 | 2638 | self._make_absent(current_old) | 2697 | self._make_absent(current_old) |
281 | 2639 | current_old = advance(old_iterator) | 2698 | current_old = advance(old_iterator) |
283 | 2640 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2699 | self._mark_modified() |
284 | 2641 | self._id_index = None | 2700 | self._id_index = None |
285 | 2642 | self._packed_stat_index = None | 2701 | self._packed_stat_index = None |
286 | 2643 | if tracing: | 2702 | if tracing: |
287 | @@ -2692,7 +2751,7 @@ | |||
288 | 2692 | if update_tree_details[0][0] == 'a': # absent | 2751 | if update_tree_details[0][0] == 'a': # absent |
289 | 2693 | raise AssertionError('bad row %r' % (update_tree_details,)) | 2752 | raise AssertionError('bad row %r' % (update_tree_details,)) |
290 | 2694 | update_tree_details[0] = DirState.NULL_PARENT_DETAILS | 2753 | update_tree_details[0] = DirState.NULL_PARENT_DETAILS |
292 | 2695 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2754 | self._mark_modified() |
293 | 2696 | return last_reference | 2755 | return last_reference |
294 | 2697 | 2756 | ||
295 | 2698 | def update_minimal(self, key, minikind, executable=False, fingerprint='', | 2757 | def update_minimal(self, key, minikind, executable=False, fingerprint='', |
296 | @@ -2857,7 +2916,7 @@ | |||
297 | 2857 | if not present: | 2916 | if not present: |
298 | 2858 | self._dirblocks.insert(block_index, (subdir_key[0], [])) | 2917 | self._dirblocks.insert(block_index, (subdir_key[0], [])) |
299 | 2859 | 2918 | ||
301 | 2860 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2919 | self._mark_modified() |
302 | 2861 | 2920 | ||
303 | 2862 | def _maybe_remove_row(self, block, index, id_index): | 2921 | def _maybe_remove_row(self, block, index, id_index): |
304 | 2863 | """Remove index if it is absent or relocated across the row. | 2922 | """Remove index if it is absent or relocated across the row. |
305 | @@ -3158,7 +3217,7 @@ | |||
306 | 3158 | else: | 3217 | else: |
307 | 3159 | entry[1][0] = ('l', '', stat_value.st_size, | 3218 | entry[1][0] = ('l', '', stat_value.st_size, |
308 | 3160 | False, DirState.NULLSTAT) | 3219 | False, DirState.NULLSTAT) |
310 | 3161 | state._dirblock_state = DirState.IN_MEMORY_MODIFIED | 3220 | state._mark_modified([entry]) |
311 | 3162 | return link_or_sha1 | 3221 | return link_or_sha1 |
312 | 3163 | 3222 | ||
313 | 3164 | 3223 | ||
314 | 3165 | 3224 | ||
315 | === modified file 'bzrlib/workingtree_4.py' | |||
316 | --- bzrlib/workingtree_4.py 2009-08-25 04:43:21 +0000 | |||
317 | +++ bzrlib/workingtree_4.py 2009-08-31 04:37:35 +0000 | |||
318 | @@ -219,7 +219,7 @@ | |||
319 | 219 | local_path = self.bzrdir.get_workingtree_transport(None | 219 | local_path = self.bzrdir.get_workingtree_transport(None |
320 | 220 | ).local_abspath('dirstate') | 220 | ).local_abspath('dirstate') |
321 | 221 | self._dirstate = dirstate.DirState.on_file(local_path, | 221 | self._dirstate = dirstate.DirState.on_file(local_path, |
323 | 222 | self._sha1_provider()) | 222 | self._sha1_provider(), self._worth_saving_limit()) |
324 | 223 | return self._dirstate | 223 | return self._dirstate |
325 | 224 | 224 | ||
326 | 225 | def _sha1_provider(self): | 225 | def _sha1_provider(self): |
327 | @@ -234,6 +234,15 @@ | |||
328 | 234 | else: | 234 | else: |
329 | 235 | return None | 235 | return None |
330 | 236 | 236 | ||
331 | 237 | def _worth_saving_limit(self): | ||
332 | 238 | """How many hash changes are ok before we must save the dirstate. | ||
333 | 239 | |||
334 | 240 | :return: an integer. -1 means never save. | ||
335 | 241 | """ | ||
336 | 242 | # XXX: In the future, we could return -1 for logical read-only | ||
337 | 243 | # operations like status. For now, just use a small number. | ||
338 | 244 | return 10 | ||
339 | 245 | |||
340 | 237 | def filter_unversioned_files(self, paths): | 246 | def filter_unversioned_files(self, paths): |
341 | 238 | """Filter out paths that are versioned. | 247 | """Filter out paths that are versioned. |
342 | 239 | 248 | ||
343 | @@ -839,7 +848,7 @@ | |||
344 | 839 | rollback_rename() | 848 | rollback_rename() |
345 | 840 | raise | 849 | raise |
346 | 841 | result.append((from_rel, to_rel)) | 850 | result.append((from_rel, to_rel)) |
348 | 842 | state._dirblock_state = dirstate.DirState.IN_MEMORY_MODIFIED | 851 | state._mark_modified() |
349 | 843 | self._make_dirty(reset_inventory=False) | 852 | self._make_dirty(reset_inventory=False) |
350 | 844 | 853 | ||
351 | 845 | return result | 854 | return result |
Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.
This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.
Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.
More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.
Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.
Thoughts?