Bazaar

Merge lp:~jameinel/bzr/2.4-transform-cache-sha-740932 into lp:bzr

2.4-transform-cache-sha-740932
Merge into bzr.dev

Proposed by John A Meinel on 2011-04-05

Status:

Merged

Approved by:

Andrew Bennetts on 2011-04-07

Approved revision:

no longer in the source branch.

Merged at revision:

5775

Proposed branch:

lp:~jameinel/bzr/2.4-transform-cache-sha-740932

Merge into:

lp:bzr

Prerequisite:

lp:~jameinel/bzr/2.4-transform-create-file-740932

Diff against target:

134 lines (+61/-6)

3 files modified

bzrlib/tests/test_transform.py (+48/-0)
bzrlib/transform.py (+7/-6)
doc/en/release-notes/bzr-2.4.txt (+6/-0)

To merge this branch:

bzr merge lp:~jameinel/bzr/2.4-transform-cache-sha-740932

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Andrew Bennetts		2011-04-05	Approve on 2011-04-07
Review via email: mp+56364@code.launchpad.net

Commit message

Change ``bzrlib.transform.build_tree`` to pass the sha1 of newly created files. This avoids re-hashing content on the next 'bzr st'. Saving ~30s on a gcc-sized tree.

Description of the change

This is finally something that fixes the 'bzr status' time after 'bzr checkout'.

Specifically it updates 'build_tree' to save and pass around the sha1 to Transform.create_file.

The difference is pretty noticeable:

Without:
1m13s time bzr co --lightweight
37s time bzr st
1s time bzr st

With:

1m13s time bzr co --lightweight
5s time bzr st
1s time bzr st

Note, we still have some overhead. Specifically, we *only* cache the stat info for files. We don't really have a way to do so for Directories. So the first 'bzr st' still stats all the dirs and ends up rewriting the Dirstate file. However, we *don't* spend 32s re-reading and re-sha1ing all 576MB of gcc's tree. And even if we did, there would still be *some* files that are going to be within our 3s window, unless we add a time.sleep(3) before we call _apply_observed_sha1s. I'm considering adding it for commands that take longer than 30s during the TreeTransform phase (so it is less than 10% overhead.)

The one downside is it is one more O(new_files) dict. Which then stores a tuple of sha and stat value. On my 64-bit machine, the numbers are:

w/o: VmPeak: 721700 kB
w/: VmPeak: 745276 kB

Given we are already in the 700MB range, I think we can live with another 24MB. I suppose that is one of the peak-memory conditions that I should profile and reduce. :)

Revision history for this message

Martin Pool (mbp) wrote on 2011-04-06:

Thanks for tackling this.

I wonder if we can take a further step in this direction still within
the current format by just making sure we never actually hash things
from disk: it's enough to just say that they're either identical to
the current tree, or they're unknown. Recording hashes different to
those in any of the basis trees has very little value.

I don't see why we would be recording or checking the stat values of
directories at all. I don't think it can save us any time.

Martin

Revision history for this message

John A Meinel (jameinel) wrote on 2011-04-06:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/06/2011 02:40 AM, Martin Pool wrote:
> Thanks for tackling this.
>
> I wonder if we can take a further step in this direction still within
> the current format by just making sure we never actually hash things
> from disk: it's enough to just say that they're either identical to
> the current tree, or they're unknown. Recording hashes different to
> those in any of the basis trees has very little value.

I think first we need to establish that they *are* identical to the
basis. The next is actually getting the higher levels to understand
"iter_maybe_changes". So that you can understand that the values you are
getting *might* be changed, but it isn't guaranteed.

>
> I don't see why we would be recording or checking the stat values of
> directories at all. I don't think it can save us any time.
>
> Martin

I have the feeling the code just doesn't pay attention to type. We need
to call "update_entry", in general, because a directory could have
become a file, etc. I agree that I don't think we need to specifically
be caching the stat information for it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2cEA0ACgkQJdeBCYSNAAMuQwCdHXstnwyLMSJS2z7smu/Pxr9E
XoMAniv+lpEtzZcD5FA7eQ/rBz5Tg7tr
=gM1C
-----END PGP SIGNATURE-----

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-04-07:

The change to bzrlib/dirstate.py looks reasonable (it's certainly more readable!) but I'm not sure how it's related to the rest of this patch… hmm, I guess that bit has actually already landed from with your 2.4-observed-sha-updates-size patch anyway.

Anyway, the change and new tests look good to me!

review: Approve

Revision history for this message

Robert Collins (lifeless) wrote on 2011-04-07:

On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
> Thanks for tackling this.
>
> I wonder if we can take a further step in this direction still within
> the current format by just making sure we never actually hash things
> from disk: it's enough to just say that they're either identical to
> the current tree, or they're unknown. Recording hashes different to
> those in any of the basis trees has very little value.

Thats true, but its not symptomatic here: we were rehashing to detect
differences and storing the result (same as basis). We already take
care in 'bzr st' not to hash things we *know* are different (e.g.
different file length).

And as John says, we can't tell something is a dir still unless we stat it ;)

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2011-04-07:

On 7 April 2011 14:36, Robert Collins <email address hidden> wrote:
> On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
> And as John says, we can't tell something is a directory still unless we stat it ;)

My point is that merely discovering something is still a directory, or
is a directory with a new mtime/size/ctime, should not cause us to
rewrite the dirstate file, and according to John's earlier comment it
does. For a directory, we cannot do anything useful with the stat
cache and so there is no point keeping it up to date.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-04-07:

On Thu, Apr 7, 2011 at 4:49 PM, Martin Pool <email address hidden> wrote:
> On 7 April 2011 14:36, Robert Collins <email address hidden> wrote:
>> On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
>> And as John says, we can't tell something is a directory still unless we stat it ;)
>
> My point is that merely discovering something is still a directory, or
> is a directory with a new mtime/size/ctime, should not cause us to
> rewrite the dirstate file, and according to John's earlier comment it
> does. For a directory, we cannot do anything useful with the stat
> cache and so there is no point keeping it up to date.

Ah! I misunderstood - I was focused on 'cost of hashing' not 'cost of
writing a new dirstate file.

Theres a separate flag for dirstate file dirtiness - a threshold below
which a new file isn't written for stat cache changes. Probably stat
changes in directories should not count towards that threshold.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-04-08:

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

John A Meinel

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

 === modified file 'bzrlib/tests/test_transform.py'
 --- bzrlib/tests/test_transform.py	2011-04-05 13:25:50 +0000
 +++ bzrlib/tests/test_transform.py	2011-04-08 13:34:19 +0000
@@ -1970,6 +1970,18 @@
          self.addCleanup(target.unlock)
          self.assertEqual([], list(target.iter_changes(revision_tree)))
++    def test_build_tree_accelerator_tree_observes_sha1(self):
++        source = self.create_ab_tree()
++        sha1 = osutils.sha_string('A')
++        target = self.make_branch_and_tree('target')
++        target.lock_write()
++        self.addCleanup(target.unlock)
++        state = target.current_dirstate()
++        state._cutoff_time = time.time() + 60
++        build_tree(source.basis_tree(), target, source)
++        entry = state._get_entry(0, path_utf8='file1')
++        self.assertEqual(sha1, entry[1][0][1])
++
      def test_build_tree_accelerator_tree_missing_file(self):
          source = self.create_ab_tree()
          os.unlink('source/file1')
@@ -2133,6 +2145,42 @@
          self.assertEqual('file.moved', target.id2path('lower-id'))
          self.assertEqual('FILE', target.id2path('upper-id'))
++    def test_build_tree_observes_sha(self):
++        source = self.make_branch_and_tree('source')
++        self.build_tree(['source/file1', 'source/dir/', 'source/dir/file2'])
++        source.add(['file1', 'dir', 'dir/file2'],
++                   ['file1-id', 'dir-id', 'file2-id'])
++        source.commit('new files')
++        target = self.make_branch_and_tree('target')
++        target.lock_write()
++        self.addCleanup(target.unlock)
++        # We make use of the fact that DirState caches its cutoff time. So we
++        # set the 'safe' time to one minute in the future.
++        state = target.current_dirstate()
++        state._cutoff_time = time.time() + 60
++        build_tree(source.basis_tree(), target)
++        entry1_sha = osutils.sha_file_by_name('source/file1')
++        entry2_sha = osutils.sha_file_by_name('source/dir/file2')
++        # entry[1] is the state information, entry[1][0] is the state of the
++        # working tree, entry[1][0][1] is the sha value for the current working
++        # tree
++        entry1 = state._get_entry(0, path_utf8='file1')
++        self.assertEqual(entry1_sha, entry1[1][0][1])
++        # The 'size' field must also be set.
++        self.assertEqual(25, entry1[1][0][2])
++        entry1_state = entry1[1][0]
++        entry2 = state._get_entry(0, path_utf8='dir/file2')
++        self.assertEqual(entry2_sha, entry2[1][0][1])
++        self.assertEqual(29, entry2[1][0][2])
++        entry2_state = entry2[1][0]
++        # Now, make sure that we don't have to re-read the content. The
++        # packed_stat should match exactly.
++        self.assertEqual(entry1_sha, target.get_file_sha1('file1-id', 'file1'))
++        self.assertEqual(entry2_sha,
++                         target.get_file_sha1('file2-id', 'dir/file2'))
++        self.assertEqual(entry1_state, entry1[1][0])
++        self.assertEqual(entry2_state, entry2[1][0])
++
  class TestCommitTransform(tests.TestCaseWithTransport):
 === modified file 'bzrlib/transform.py'
 --- bzrlib/transform.py	2011-04-07 10:36:24 +0000
 +++ bzrlib/transform.py	2011-04-08 13:34:19 +0000
@@ -2541,7 +2541,7 @@
                      executable = tree.is_executable(file_id, tree_path)
                      if executable:
                          tt.set_executability(executable, trans_id)
--                    trans_data = (trans_id, tree_path)
++                    trans_data = (trans_id, tree_path, entry.text_sha1)
                      deferred_contents.append((file_id, trans_data))
                  else:
                      file_trans_id[file_id] = new_by_entry(tt, entry, parent_id,
@@ -2592,10 +2592,11 @@
          unchanged = dict(unchanged)
          new_desired_files = []
          count = 0
--        for file_id, (trans_id, tree_path) in desired_files:
++        for file_id, (trans_id, tree_path, text_sha1) in desired_files:
              accelerator_path = unchanged.get(file_id)
              if accelerator_path is None:
--                new_desired_files.append((file_id, (trans_id, tree_path)))
++                new_desired_files.append((file_id,
++                    (trans_id, tree_path, text_sha1)))
                  continue
              pb.update('Adding file contents', count + offset, total)
              if hardlink:
@@ -2608,7 +2609,7 @@
                      contents = filtered_output_bytes(contents, filters,
                          ContentFilterContext(tree_path, tree))
                  try:
--                    tt.create_file(contents, trans_id)
++                    tt.create_file(contents, trans_id, sha1=text_sha1)
                  finally:
                      try:
                          contents.close()
@@ -2617,13 +2618,13 @@
                          pass
              count += 1
          offset += count
--    for count, ((trans_id, tree_path), contents) in enumerate(
++    for count, ((trans_id, tree_path, text_sha1), contents) in enumerate(
              tree.iter_files_bytes(new_desired_files)):
          if wt.supports_content_filtering():
              filters = wt._content_filter_stack(tree_path)
              contents = filtered_output_bytes(contents, filters,
                  ContentFilterContext(tree_path, tree))
--        tt.create_file(contents, trans_id)
++        tt.create_file(contents, trans_id, sha1=text_sha1)
          pb.update('Adding file contents', count + offset, total)
 === modified file 'doc/en/release-notes/bzr-2.4.txt'
 --- doc/en/release-notes/bzr-2.4.txt	2011-04-08 10:54:03 +0000
 +++ doc/en/release-notes/bzr-2.4.txt	2011-04-08 13:34:19 +0000
@@ -26,6 +26,12 @@
  .. Improvements to existing commands, especially improved performance
     or memory usage, or better results.
++* When building a new WorkingTree (such as during ``bzr co`` or
++  ``bzr branch``) we now properly store the stat and hash of files that
++  are old enough. This saves a fair amount of time on the first
++  ``bzr status`` (on a 500MB tree, it saves about 30+s).
++  (John Arbash Meinel, #740932)
++
  * Resolve ``lp:FOO`` urls locally rather than doing an XMLRPC request if
    the user has done ``bzr launchpad-login``. The bzr+ssh URLs were already
    being handed off to the remote server anyway (xmlrpc has been mapping