Merge into trunk-old : faster-dirstate-saving : Code : Bazaar

Status:

Superseded

Proposed branch:

lp:~ian-clatworthy/bzr/faster-dirstate-saving

Merge into:

lp:~bzr/bzr/trunk-old

Diff against target:

268 lines

To merge this branch:

bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving

Medium

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
bzr-core		2009-05-28	Pending
Review via email: mp+6841@code.launchpad.net

This proposal has been superseded by a proposal from 2009-05-28.

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-28:

#

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-05-28:

#

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-28:

#

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-05-28:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2009-05-28:

#

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

Bazaar

Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'bzrlib/_dirstate_helpers_c.pyx'
 --- bzrlib/_dirstate_helpers_c.pyx	2009-03-23 14:59:43 +0000
 +++ bzrlib/_dirstate_helpers_c.pyx	2009-05-28 12:35:22 +0000
@@ -909,7 +909,7 @@
          else:
              entry[1][0] = ('l', '', stat_value.st_size,
                             False, DirState.NULLSTAT)
--    self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++    self._mark_modified([entry])
      return link_or_sha1
 === modified file 'bzrlib/dirstate.py'
 --- bzrlib/dirstate.py	2009-05-06 05:36:28 +0000
 +++ bzrlib/dirstate.py	2009-05-28 12:35:22 +0000
@@ -409,11 +409,46 @@
          # during commit.
          self._last_block_index = None
          self._last_entry_index = None
++        # If True, use the per-entry field cache for faster serialisation.
++        # If False, disable it. If None, it is not used but may be enabled.
++        self._use_smart_saving = None
++        # The set of known changes
++        self._known_changes = set()
++        # The cache of serialised lines. When built, this is a tuple of
++        # 2 sorted lists that we "walk" while serialising.
++        self._line_cache = None
      def __repr__(self):
          return "%s(%r)" % \
              (self.__class__.__name__, self._filename)
++    def _mark_modified(self, entries=None, header_too=False):
++        """Mark this dirstate as modified.
++
++        :param entries: if non-None, mark just these entries as modified.
++        :param header_too: mark the header modified as well, not just the
++          dirblocks.
++        """
++        #trace.mutter_callsite(3, "modified entries: %s", entries)
++        if entries:
++            self._known_changes.update([e[0] for e in entries])
++            # We only enable save saving is it hasn't already been disabled
++            if self._use_smart_saving is not False:
++                self._use_smart_saving = True
++        else:
++            # We don't know exactly what changed so disable smart saving
++            self._use_smart_saving = False
++        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        if header_too:
++            self._header_state = DirState.IN_MEMORY_MODIFIED
++
++    def _mark_unmodified(self):
++        """Mark this dirstate as unmodified."""
++        self._header_state = DirState.IN_MEMORY_UNMODIFIED
++        self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
++        self._use_smart_saving = None
++        self._known_changes = set()
++
      def add(self, path, file_id, kind, stat, fingerprint):
          """Add a path to be tracked.
@@ -545,7 +580,7 @@
          if kind == 'directory':
             # insert a new dirblock
             self._ensure_block(block_index, entry_index, utf8path)
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified()
          if self._id_index:
              self._id_index.setdefault(entry_key[2], set()).add(entry_key)
@@ -1017,8 +1052,7 @@
          self._ghosts = []
          self._parents = [parents[0]]
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
--        self._header_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified(header_too=True)
      def _empty_parent_info(self):
          return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
@@ -1460,8 +1494,7 @@
          # Apply in-situ changes.
          self._update_basis_apply_changes(changes)
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
--        self._header_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified(header_too=True)
          self._id_index = None
          return
@@ -1594,7 +1627,7 @@
                  and stat_value.st_ctime < self._cutoff_time):
                  entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
                      packed_stat)
--                self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++                self._mark_modified()
      def _sha_cutoff_time(self):
          """Return cutoff time.
@@ -1658,14 +1691,13 @@
          """Serialise the entire dirstate to a sequence of lines."""
          if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
              self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
--            # read whats on disk.
++            # read what's on disk.
              self._state_file.seek(0)
              return self._state_file.readlines()
          lines = []
          lines.append(self._get_parents_line(self.get_parent_ids()))
          lines.append(self._get_ghosts_line(self._ghosts))
--        # append the root line which is special cased
--        lines.extend(map(self._entry_to_line, self._iter_entries()))
++        lines.extend(self._get_entry_lines())
          return self._get_output_lines(lines)
      def _get_ghosts_line(self, ghost_ids):
@@ -1676,6 +1708,35 @@
          """Create a line for the state file for parents information."""
          return '\0'.join([str(len(parent_ids))] + parent_ids)
++    def _get_entry_lines(self):
++        """Create lines for entries."""
++        if self._use_smart_saving and self._line_cache:
++            # We unroll this case for better performance ...
++            # The line cache is a tuple of 2 ordered lists: keys and lines.
++            # We keep track of successful matches and only search from there
++            # on next time.
++            entry_to_line = self._entry_to_line
++            known_changes = self._known_changes
++            index = 0
++            keys, serialised = self._line_cache
++            result = []
++            for entry in self._iter_entries():
++                key = entry[0]
++                if key in known_changes:
++                    result.append(entry_to_line(entry))
++                else:
++                    if keys[index] != key:
++                        try:
++                            index = keys.index(key, index + 1)
++                        except ValueError:
++                            result.append(entry_to_line(entry))
++                            continue
++                    result.append(serialised[index])
++                    index += 1
++            return result
++        else:
++            return map(self._entry_to_line, self._iter_entries())
++
      def _get_fields_to_entry(self):
          """Get a function which converts entry fields into a entry record.
@@ -2057,6 +2118,39 @@
          self._read_header_if_needed()
          if self._dirblock_state == DirState.NOT_IN_MEMORY:
              _read_dirblocks(self)
++            # While it's a small overhead, it's good to build the line cache
++            # now while we know that the dirstate is loaded and unmodified.
++            # It we leave it till later, it takes a while longer because the
++            # memory representation and file representation are no longer
++            # in sync.
++            self._build_line_cache()
++
++    def _build_line_cache(self):
++        """Build the line cache.
++
++        The line cache maps entry keys to serialised lines via
++        a tuple of 2 sorted lists.
++        """
++        self._state_file.seek(0)
++        lines = self._state_file.readlines()
++        # There are 5 header lines: 3 in the prelude, a line for
++        # parents and a line for ghosts. There is also a trailing
++        # empty line. We skip over those.
++        # Each line starts with a null and ends with a null and
++        # newline. We don't keep those because the serialisation
++        # process adds them.
++        values = [l[1:-2] for l in lines[5:-1]]
++        if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
++            keys = []
++            for directory in self._dirblocks:
++                keys.extend([e[0] for e in directory[1]])
++        else:
++            # Be safe and calculate the keys from the lines
++            keys = []
++            for v in values:
++                fields = v.split('\0', 3)
++                keys.append((fields[0], fields[1], fields[2]))
++        self._line_cache = (keys, values)
      def _read_header(self):
          """This reads in the metadata header, and the parent ids.
@@ -2162,8 +2256,7 @@
                  self._state_file.writelines(self.get_lines())
                  self._state_file.truncate()
                  self._state_file.flush()
--                self._header_state = DirState.IN_MEMORY_UNMODIFIED
--                self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
++                self._mark_unmodified()
              finally:
                  if grabbed_write_lock:
                      self._lock_token = self._lock_token.restore_read_lock()
@@ -2185,8 +2278,7 @@
          """
          # our memory copy is now authoritative.
          self._dirblocks = dirblocks
--        self._header_state = DirState.IN_MEMORY_MODIFIED
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified(header_too=True)
          self._parents = list(parent_ids)
          self._id_index = None
          self._packed_stat_index = None
@@ -2212,7 +2304,7 @@
          self._make_absent(entry)
          self.update_minimal(('', '', new_id), 'd',
              path_utf8='', packed_stat=entry[1][0][4])
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified()
          if self._id_index is not None:
              self._id_index.setdefault(new_id, set()).add(entry[0])
@@ -2352,8 +2444,7 @@
          self._entries_to_current_state(new_entries)
          self._parents = [rev_id for rev_id, tree in trees]
          self._ghosts = list(ghosts)
--        self._header_state = DirState.IN_MEMORY_MODIFIED
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified(header_too=True)
          self._id_index = id_index
      def _sort_entries(self, entry_list):
@@ -2471,7 +2562,7 @@
                  # without seeing it in the new list.  so it must be gone.
                  self._make_absent(current_old)
                  current_old = advance(old_iterator)
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified()
          self._id_index = None
          self._packed_stat_index = None
@@ -2524,7 +2615,7 @@
              if update_tree_details[0][0] == 'a': # absent
                  raise AssertionError('bad row %r' % (update_tree_details,))
              update_tree_details[0] = DirState.NULL_PARENT_DETAILS
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified()
          return last_reference
      def update_minimal(self, key, minikind, executable=False, fingerprint='',
@@ -2650,7 +2741,7 @@
              if not present:
                  self._dirblocks.insert(block_index, (subdir_key[0], []))
--        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
++        self._mark_modified()
      def _validate(self):
          """Check that invariants on the dirblock are correct.
@@ -2936,7 +3027,7 @@
          else:
              entry[1][0] = ('l', '', stat_value.st_size,
                             False, DirState.NULLSTAT)
--    state._dirblock_state = DirState.IN_MEMORY_MODIFIED
++    state._mark_modified([entry])
      return link_or_sha1
  update_entry = py_update_entry