Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy on 2009-05-28
Status: Superseded
Proposed branch: lp:~ian-clatworthy/bzr/faster-dirstate-saving
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 268 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving
Reviewer Review Type Date Requested Status
bzr-core 2009-05-28 Pending
Review via email: mp+6841@code.launchpad.net

This proposal has been superseded by a proposal from 2009-05-28.

To post a comment you must log in.
Ian Clatworthy (ian-clatworthy) wrote :

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Andrew Bennetts (spiv) wrote :

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Ian Clatworthy (ian-clatworthy) wrote :

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Martin Pool (mbp) wrote :

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

4384. By Ian Clatworthy on 2009-05-28

support a configurable limit of changes before dirstate saving

4385. By Ian Clatworthy on 2009-06-17

merge bzr.dev r4449

4386. By Ian Clatworthy on 2009-06-17

apply review feedback

Unmerged revisions

4386. By Ian Clatworthy on 2009-06-17

apply review feedback

4385. By Ian Clatworthy on 2009-06-17

merge bzr.dev r4449

4384. By Ian Clatworthy on 2009-05-28

support a configurable limit of changes before dirstate saving

4383. By Ian Clatworthy on 2009-05-28

unroll smart saving logic into one loop in one routine

4382. By Ian Clatworthy on 2009-05-28

use a tuple of lists instead of a dict for the line cache

4381. By Ian Clatworthy on 2009-05-28

speed up dirstate serialisation, making status faster

4380. By Ian Clatworthy on 2009-05-27

prepare form smart dirstate saving

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/_dirstate_helpers_c.pyx'
--- bzrlib/_dirstate_helpers_c.pyx 2009-03-23 14:59:43 +0000
+++ bzrlib/_dirstate_helpers_c.pyx 2009-05-28 12:35:22 +0000
@@ -909,7 +909,7 @@
909 else:909 else:
910 entry[1][0] = ('l', '', stat_value.st_size,910 entry[1][0] = ('l', '', stat_value.st_size,
911 False, DirState.NULLSTAT)911 False, DirState.NULLSTAT)
912 self._dirblock_state = DirState.IN_MEMORY_MODIFIED912 self._mark_modified([entry])
913 return link_or_sha1913 return link_or_sha1
914914
915915
916916
=== modified file 'bzrlib/dirstate.py'
--- bzrlib/dirstate.py 2009-05-06 05:36:28 +0000
+++ bzrlib/dirstate.py 2009-05-28 12:35:22 +0000
@@ -409,11 +409,46 @@
409 # during commit.409 # during commit.
410 self._last_block_index = None410 self._last_block_index = None
411 self._last_entry_index = None411 self._last_entry_index = None
412 # If True, use the per-entry field cache for faster serialisation.
413 # If False, disable it. If None, it is not used but may be enabled.
414 self._use_smart_saving = None
415 # The set of known changes
416 self._known_changes = set()
417 # The cache of serialised lines. When built, this is a tuple of
418 # 2 sorted lists that we "walk" while serialising.
419 self._line_cache = None
412420
413 def __repr__(self):421 def __repr__(self):
414 return "%s(%r)" % \422 return "%s(%r)" % \
415 (self.__class__.__name__, self._filename)423 (self.__class__.__name__, self._filename)
416424
425 def _mark_modified(self, entries=None, header_too=False):
426 """Mark this dirstate as modified.
427
428 :param entries: if non-None, mark just these entries as modified.
429 :param header_too: mark the header modified as well, not just the
430 dirblocks.
431 """
432 #trace.mutter_callsite(3, "modified entries: %s", entries)
433 if entries:
434 self._known_changes.update([e[0] for e in entries])
435 # We only enable save saving is it hasn't already been disabled
436 if self._use_smart_saving is not False:
437 self._use_smart_saving = True
438 else:
439 # We don't know exactly what changed so disable smart saving
440 self._use_smart_saving = False
441 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
442 if header_too:
443 self._header_state = DirState.IN_MEMORY_MODIFIED
444
445 def _mark_unmodified(self):
446 """Mark this dirstate as unmodified."""
447 self._header_state = DirState.IN_MEMORY_UNMODIFIED
448 self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
449 self._use_smart_saving = None
450 self._known_changes = set()
451
417 def add(self, path, file_id, kind, stat, fingerprint):452 def add(self, path, file_id, kind, stat, fingerprint):
418 """Add a path to be tracked.453 """Add a path to be tracked.
419454
@@ -545,7 +580,7 @@
545 if kind == 'directory':580 if kind == 'directory':
546 # insert a new dirblock581 # insert a new dirblock
547 self._ensure_block(block_index, entry_index, utf8path)582 self._ensure_block(block_index, entry_index, utf8path)
548 self._dirblock_state = DirState.IN_MEMORY_MODIFIED583 self._mark_modified()
549 if self._id_index:584 if self._id_index:
550 self._id_index.setdefault(entry_key[2], set()).add(entry_key)585 self._id_index.setdefault(entry_key[2], set()).add(entry_key)
551586
@@ -1017,8 +1052,7 @@
10171052
1018 self._ghosts = []1053 self._ghosts = []
1019 self._parents = [parents[0]]1054 self._parents = [parents[0]]
1020 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1055 self._mark_modified(header_too=True)
1021 self._header_state = DirState.IN_MEMORY_MODIFIED
10221056
1023 def _empty_parent_info(self):1057 def _empty_parent_info(self):
1024 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -1058 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
@@ -1460,8 +1494,7 @@
1460 # Apply in-situ changes.1494 # Apply in-situ changes.
1461 self._update_basis_apply_changes(changes)1495 self._update_basis_apply_changes(changes)
14621496
1463 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1497 self._mark_modified(header_too=True)
1464 self._header_state = DirState.IN_MEMORY_MODIFIED
1465 self._id_index = None1498 self._id_index = None
1466 return1499 return
14671500
@@ -1594,7 +1627,7 @@
1594 and stat_value.st_ctime < self._cutoff_time):1627 and stat_value.st_ctime < self._cutoff_time):
1595 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],1628 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
1596 packed_stat)1629 packed_stat)
1597 self._dirblock_state = DirState.IN_MEMORY_MODIFIED1630 self._mark_modified()
15981631
1599 def _sha_cutoff_time(self):1632 def _sha_cutoff_time(self):
1600 """Return cutoff time.1633 """Return cutoff time.
@@ -1658,14 +1691,13 @@
1658 """Serialise the entire dirstate to a sequence of lines."""1691 """Serialise the entire dirstate to a sequence of lines."""
1659 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and1692 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
1660 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):1693 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
1661 # read whats on disk.1694 # read what's on disk.
1662 self._state_file.seek(0)1695 self._state_file.seek(0)
1663 return self._state_file.readlines()1696 return self._state_file.readlines()
1664 lines = []1697 lines = []
1665 lines.append(self._get_parents_line(self.get_parent_ids()))1698 lines.append(self._get_parents_line(self.get_parent_ids()))
1666 lines.append(self._get_ghosts_line(self._ghosts))1699 lines.append(self._get_ghosts_line(self._ghosts))
1667 # append the root line which is special cased1700 lines.extend(self._get_entry_lines())
1668 lines.extend(map(self._entry_to_line, self._iter_entries()))
1669 return self._get_output_lines(lines)1701 return self._get_output_lines(lines)
16701702
1671 def _get_ghosts_line(self, ghost_ids):1703 def _get_ghosts_line(self, ghost_ids):
@@ -1676,6 +1708,35 @@
1676 """Create a line for the state file for parents information."""1708 """Create a line for the state file for parents information."""
1677 return '\0'.join([str(len(parent_ids))] + parent_ids)1709 return '\0'.join([str(len(parent_ids))] + parent_ids)
16781710
1711 def _get_entry_lines(self):
1712 """Create lines for entries."""
1713 if self._use_smart_saving and self._line_cache:
1714 # We unroll this case for better performance ...
1715 # The line cache is a tuple of 2 ordered lists: keys and lines.
1716 # We keep track of successful matches and only search from there
1717 # on next time.
1718 entry_to_line = self._entry_to_line
1719 known_changes = self._known_changes
1720 index = 0
1721 keys, serialised = self._line_cache
1722 result = []
1723 for entry in self._iter_entries():
1724 key = entry[0]
1725 if key in known_changes:
1726 result.append(entry_to_line(entry))
1727 else:
1728 if keys[index] != key:
1729 try:
1730 index = keys.index(key, index + 1)
1731 except ValueError:
1732 result.append(entry_to_line(entry))
1733 continue
1734 result.append(serialised[index])
1735 index += 1
1736 return result
1737 else:
1738 return map(self._entry_to_line, self._iter_entries())
1739
1679 def _get_fields_to_entry(self):1740 def _get_fields_to_entry(self):
1680 """Get a function which converts entry fields into a entry record.1741 """Get a function which converts entry fields into a entry record.
16811742
@@ -2057,6 +2118,39 @@
2057 self._read_header_if_needed()2118 self._read_header_if_needed()
2058 if self._dirblock_state == DirState.NOT_IN_MEMORY:2119 if self._dirblock_state == DirState.NOT_IN_MEMORY:
2059 _read_dirblocks(self)2120 _read_dirblocks(self)
2121 # While it's a small overhead, it's good to build the line cache
2122 # now while we know that the dirstate is loaded and unmodified.
2123 # It we leave it till later, it takes a while longer because the
2124 # memory representation and file representation are no longer
2125 # in sync.
2126 self._build_line_cache()
2127
2128 def _build_line_cache(self):
2129 """Build the line cache.
2130
2131 The line cache maps entry keys to serialised lines via
2132 a tuple of 2 sorted lists.
2133 """
2134 self._state_file.seek(0)
2135 lines = self._state_file.readlines()
2136 # There are 5 header lines: 3 in the prelude, a line for
2137 # parents and a line for ghosts. There is also a trailing
2138 # empty line. We skip over those.
2139 # Each line starts with a null and ends with a null and
2140 # newline. We don't keep those because the serialisation
2141 # process adds them.
2142 values = [l[1:-2] for l in lines[5:-1]]
2143 if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
2144 keys = []
2145 for directory in self._dirblocks:
2146 keys.extend([e[0] for e in directory[1]])
2147 else:
2148 # Be safe and calculate the keys from the lines
2149 keys = []
2150 for v in values:
2151 fields = v.split('\0', 3)
2152 keys.append((fields[0], fields[1], fields[2]))
2153 self._line_cache = (keys, values)
20602154
2061 def _read_header(self):2155 def _read_header(self):
2062 """This reads in the metadata header, and the parent ids.2156 """This reads in the metadata header, and the parent ids.
@@ -2162,8 +2256,7 @@
2162 self._state_file.writelines(self.get_lines())2256 self._state_file.writelines(self.get_lines())
2163 self._state_file.truncate()2257 self._state_file.truncate()
2164 self._state_file.flush()2258 self._state_file.flush()
2165 self._header_state = DirState.IN_MEMORY_UNMODIFIED2259 self._mark_unmodified()
2166 self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
2167 finally:2260 finally:
2168 if grabbed_write_lock:2261 if grabbed_write_lock:
2169 self._lock_token = self._lock_token.restore_read_lock()2262 self._lock_token = self._lock_token.restore_read_lock()
@@ -2185,8 +2278,7 @@
2185 """2278 """
2186 # our memory copy is now authoritative.2279 # our memory copy is now authoritative.
2187 self._dirblocks = dirblocks2280 self._dirblocks = dirblocks
2188 self._header_state = DirState.IN_MEMORY_MODIFIED2281 self._mark_modified(header_too=True)
2189 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
2190 self._parents = list(parent_ids)2282 self._parents = list(parent_ids)
2191 self._id_index = None2283 self._id_index = None
2192 self._packed_stat_index = None2284 self._packed_stat_index = None
@@ -2212,7 +2304,7 @@
2212 self._make_absent(entry)2304 self._make_absent(entry)
2213 self.update_minimal(('', '', new_id), 'd',2305 self.update_minimal(('', '', new_id), 'd',
2214 path_utf8='', packed_stat=entry[1][0][4])2306 path_utf8='', packed_stat=entry[1][0][4])
2215 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2307 self._mark_modified()
2216 if self._id_index is not None:2308 if self._id_index is not None:
2217 self._id_index.setdefault(new_id, set()).add(entry[0])2309 self._id_index.setdefault(new_id, set()).add(entry[0])
22182310
@@ -2352,8 +2444,7 @@
2352 self._entries_to_current_state(new_entries)2444 self._entries_to_current_state(new_entries)
2353 self._parents = [rev_id for rev_id, tree in trees]2445 self._parents = [rev_id for rev_id, tree in trees]
2354 self._ghosts = list(ghosts)2446 self._ghosts = list(ghosts)
2355 self._header_state = DirState.IN_MEMORY_MODIFIED2447 self._mark_modified(header_too=True)
2356 self._dirblock_state = DirState.IN_MEMORY_MODIFIED
2357 self._id_index = id_index2448 self._id_index = id_index
23582449
2359 def _sort_entries(self, entry_list):2450 def _sort_entries(self, entry_list):
@@ -2471,7 +2562,7 @@
2471 # without seeing it in the new list. so it must be gone.2562 # without seeing it in the new list. so it must be gone.
2472 self._make_absent(current_old)2563 self._make_absent(current_old)
2473 current_old = advance(old_iterator)2564 current_old = advance(old_iterator)
2474 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2565 self._mark_modified()
2475 self._id_index = None2566 self._id_index = None
2476 self._packed_stat_index = None2567 self._packed_stat_index = None
24772568
@@ -2524,7 +2615,7 @@
2524 if update_tree_details[0][0] == 'a': # absent2615 if update_tree_details[0][0] == 'a': # absent
2525 raise AssertionError('bad row %r' % (update_tree_details,))2616 raise AssertionError('bad row %r' % (update_tree_details,))
2526 update_tree_details[0] = DirState.NULL_PARENT_DETAILS2617 update_tree_details[0] = DirState.NULL_PARENT_DETAILS
2527 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2618 self._mark_modified()
2528 return last_reference2619 return last_reference
25292620
2530 def update_minimal(self, key, minikind, executable=False, fingerprint='',2621 def update_minimal(self, key, minikind, executable=False, fingerprint='',
@@ -2650,7 +2741,7 @@
2650 if not present:2741 if not present:
2651 self._dirblocks.insert(block_index, (subdir_key[0], []))2742 self._dirblocks.insert(block_index, (subdir_key[0], []))
26522743
2653 self._dirblock_state = DirState.IN_MEMORY_MODIFIED2744 self._mark_modified()
26542745
2655 def _validate(self):2746 def _validate(self):
2656 """Check that invariants on the dirblock are correct.2747 """Check that invariants on the dirblock are correct.
@@ -2936,7 +3027,7 @@
2936 else:3027 else:
2937 entry[1][0] = ('l', '', stat_value.st_size,3028 entry[1][0] = ('l', '', stat_value.st_size,
2938 False, DirState.NULLSTAT)3029 False, DirState.NULLSTAT)
2939 state._dirblock_state = DirState.IN_MEMORY_MODIFIED3030 state._mark_modified([entry])
2940 return link_or_sha13031 return link_or_sha1
2941update_entry = py_update_entry3032update_entry = py_update_entry
29423033