Merge lp:~jameinel/bzr/2.4-transform-cache-sha-740932 into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: Andrew Bennetts
Approved revision: no longer in the source branch.
Merged at revision: 5775
Proposed branch: lp:~jameinel/bzr/2.4-transform-cache-sha-740932
Merge into: lp:bzr
Prerequisite: lp:~jameinel/bzr/2.4-transform-create-file-740932
Diff against target: 134 lines (+61/-6)
3 files modified
bzrlib/tests/test_transform.py (+48/-0)
bzrlib/transform.py (+7/-6)
doc/en/release-notes/bzr-2.4.txt (+6/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.4-transform-cache-sha-740932
Reviewer Review Type Date Requested Status
Andrew Bennetts Approve
Review via email: mp+56364@code.launchpad.net

Commit message

Change ``bzrlib.transform.build_tree`` to pass the sha1 of newly created files. This avoids re-hashing content on the next 'bzr st'. Saving ~30s on a gcc-sized tree.

Description of the change

This is finally something that fixes the 'bzr status' time after 'bzr checkout'.

Specifically it updates 'build_tree' to save and pass around the sha1 to Transform.create_file.

The difference is pretty noticeable:

Without:
1m13s time bzr co --lightweight
  37s time bzr st
   1s time bzr st

With:

1m13s time bzr co --lightweight
   5s time bzr st
   1s time bzr st

Note, we still have some overhead. Specifically, we *only* cache the stat info for files. We don't really have a way to do so for Directories. So the first 'bzr st' still stats all the dirs and ends up rewriting the Dirstate file. However, we *don't* spend 32s re-reading and re-sha1ing all 576MB of gcc's tree. And even if we did, there would still be *some* files that are going to be within our 3s window, unless we add a time.sleep(3) before we call _apply_observed_sha1s. I'm considering adding it for commands that take longer than 30s during the TreeTransform phase (so it is less than 10% overhead.)

The one downside is it is one more O(new_files) dict. Which then stores a tuple of sha and stat value. On my 64-bit machine, the numbers are:

w/o: VmPeak: 721700 kB
w/: VmPeak: 745276 kB

Given we are already in the 700MB range, I think we can live with another 24MB. I suppose that is one of the peak-memory conditions that I should profile and reduce. :)

To post a comment you must log in.
Revision history for this message
Martin Pool (mbp) wrote :

Thanks for tackling this.

I wonder if we can take a further step in this direction still within
the current format by just making sure we never actually hash things
from disk: it's enough to just say that they're either identical to
the current tree, or they're unknown. Recording hashes different to
those in any of the basis trees has very little value.

I don't see why we would be recording or checking the stat values of
directories at all. I don't think it can save us any time.

Martin

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/06/2011 02:40 AM, Martin Pool wrote:
> Thanks for tackling this.
>
> I wonder if we can take a further step in this direction still within
> the current format by just making sure we never actually hash things
> from disk: it's enough to just say that they're either identical to
> the current tree, or they're unknown. Recording hashes different to
> those in any of the basis trees has very little value.

I think first we need to establish that they *are* identical to the
basis. The next is actually getting the higher levels to understand
"iter_maybe_changes". So that you can understand that the values you are
getting *might* be changed, but it isn't guaranteed.

>
> I don't see why we would be recording or checking the stat values of
> directories at all. I don't think it can save us any time.
>
> Martin

I have the feeling the code just doesn't pay attention to type. We need
to call "update_entry", in general, because a directory could have
become a file, etc. I agree that I don't think we need to specifically
be caching the stat information for it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2cEA0ACgkQJdeBCYSNAAMuQwCdHXstnwyLMSJS2z7smu/Pxr9E
XoMAniv+lpEtzZcD5FA7eQ/rBz5Tg7tr
=gM1C
-----END PGP SIGNATURE-----

Revision history for this message
Andrew Bennetts (spiv) wrote :

The change to bzrlib/dirstate.py looks reasonable (it's certainly more readable!) but I'm not sure how it's related to the rest of this patch… hmm, I guess that bit has actually already landed from with your 2.4-observed-sha-updates-size patch anyway.

Anyway, the change and new tests look good to me!

review: Approve
Revision history for this message
Robert Collins (lifeless) wrote :

On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
> Thanks for tackling this.
>
> I wonder if we can take a further step in this direction still within
> the current format by just making sure we never actually hash things
> from disk: it's enough to just say that they're either identical to
> the current tree, or they're unknown.  Recording hashes different to
> those in any of the basis trees has very little value.

Thats true, but its not symptomatic here: we were rehashing to detect
differences and storing the result (same as basis). We already take
care in 'bzr st' not to hash things we *know* are different (e.g.
different file length).

And as John says, we can't tell something is a dir still unless we stat it ;)

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

On 7 April 2011 14:36, Robert Collins <email address hidden> wrote:
> On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
> And as John says, we can't tell something is a directory still unless we stat it ;)

My point is that merely discovering something is still a directory, or
is a directory with a new mtime/size/ctime, should not cause us to
rewrite the dirstate file, and according to John's earlier comment it
does. For a directory, we cannot do anything useful with the stat
cache and so there is no point keeping it up to date.

Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, Apr 7, 2011 at 4:49 PM, Martin Pool <email address hidden> wrote:
> On 7 April 2011 14:36, Robert Collins <email address hidden> wrote:
>> On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
>> And as John says, we can't tell something is a directory still unless we stat it ;)
>
> My point is that merely discovering something is still a directory, or
> is a directory with a new mtime/size/ctime, should not cause us to
> rewrite the dirstate file, and according to John's earlier comment it
> does.  For a directory, we cannot do anything useful with the stat
> cache and so there is no point keeping it up to date.

Ah! I misunderstood - I was focused on 'cost of hashing' not 'cost of
writing a new dirstate file.

Theres a separate flag for dirstate file dirtiness - a threshold below
which a new file isn't written for stat cache changes. Probably stat
changes in directories should not count towards that threshold.

Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/tests/test_transform.py'
--- bzrlib/tests/test_transform.py 2011-04-05 13:25:50 +0000
+++ bzrlib/tests/test_transform.py 2011-04-08 13:34:19 +0000
@@ -1970,6 +1970,18 @@
1970 self.addCleanup(target.unlock)1970 self.addCleanup(target.unlock)
1971 self.assertEqual([], list(target.iter_changes(revision_tree)))1971 self.assertEqual([], list(target.iter_changes(revision_tree)))
19721972
1973 def test_build_tree_accelerator_tree_observes_sha1(self):
1974 source = self.create_ab_tree()
1975 sha1 = osutils.sha_string('A')
1976 target = self.make_branch_and_tree('target')
1977 target.lock_write()
1978 self.addCleanup(target.unlock)
1979 state = target.current_dirstate()
1980 state._cutoff_time = time.time() + 60
1981 build_tree(source.basis_tree(), target, source)
1982 entry = state._get_entry(0, path_utf8='file1')
1983 self.assertEqual(sha1, entry[1][0][1])
1984
1973 def test_build_tree_accelerator_tree_missing_file(self):1985 def test_build_tree_accelerator_tree_missing_file(self):
1974 source = self.create_ab_tree()1986 source = self.create_ab_tree()
1975 os.unlink('source/file1')1987 os.unlink('source/file1')
@@ -2133,6 +2145,42 @@
2133 self.assertEqual('file.moved', target.id2path('lower-id'))2145 self.assertEqual('file.moved', target.id2path('lower-id'))
2134 self.assertEqual('FILE', target.id2path('upper-id'))2146 self.assertEqual('FILE', target.id2path('upper-id'))
21352147
2148 def test_build_tree_observes_sha(self):
2149 source = self.make_branch_and_tree('source')
2150 self.build_tree(['source/file1', 'source/dir/', 'source/dir/file2'])
2151 source.add(['file1', 'dir', 'dir/file2'],
2152 ['file1-id', 'dir-id', 'file2-id'])
2153 source.commit('new files')
2154 target = self.make_branch_and_tree('target')
2155 target.lock_write()
2156 self.addCleanup(target.unlock)
2157 # We make use of the fact that DirState caches its cutoff time. So we
2158 # set the 'safe' time to one minute in the future.
2159 state = target.current_dirstate()
2160 state._cutoff_time = time.time() + 60
2161 build_tree(source.basis_tree(), target)
2162 entry1_sha = osutils.sha_file_by_name('source/file1')
2163 entry2_sha = osutils.sha_file_by_name('source/dir/file2')
2164 # entry[1] is the state information, entry[1][0] is the state of the
2165 # working tree, entry[1][0][1] is the sha value for the current working
2166 # tree
2167 entry1 = state._get_entry(0, path_utf8='file1')
2168 self.assertEqual(entry1_sha, entry1[1][0][1])
2169 # The 'size' field must also be set.
2170 self.assertEqual(25, entry1[1][0][2])
2171 entry1_state = entry1[1][0]
2172 entry2 = state._get_entry(0, path_utf8='dir/file2')
2173 self.assertEqual(entry2_sha, entry2[1][0][1])
2174 self.assertEqual(29, entry2[1][0][2])
2175 entry2_state = entry2[1][0]
2176 # Now, make sure that we don't have to re-read the content. The
2177 # packed_stat should match exactly.
2178 self.assertEqual(entry1_sha, target.get_file_sha1('file1-id', 'file1'))
2179 self.assertEqual(entry2_sha,
2180 target.get_file_sha1('file2-id', 'dir/file2'))
2181 self.assertEqual(entry1_state, entry1[1][0])
2182 self.assertEqual(entry2_state, entry2[1][0])
2183
21362184
2137class TestCommitTransform(tests.TestCaseWithTransport):2185class TestCommitTransform(tests.TestCaseWithTransport):
21382186
21392187
=== modified file 'bzrlib/transform.py'
--- bzrlib/transform.py 2011-04-07 10:36:24 +0000
+++ bzrlib/transform.py 2011-04-08 13:34:19 +0000
@@ -2541,7 +2541,7 @@
2541 executable = tree.is_executable(file_id, tree_path)2541 executable = tree.is_executable(file_id, tree_path)
2542 if executable:2542 if executable:
2543 tt.set_executability(executable, trans_id)2543 tt.set_executability(executable, trans_id)
2544 trans_data = (trans_id, tree_path)2544 trans_data = (trans_id, tree_path, entry.text_sha1)
2545 deferred_contents.append((file_id, trans_data))2545 deferred_contents.append((file_id, trans_data))
2546 else:2546 else:
2547 file_trans_id[file_id] = new_by_entry(tt, entry, parent_id,2547 file_trans_id[file_id] = new_by_entry(tt, entry, parent_id,
@@ -2592,10 +2592,11 @@
2592 unchanged = dict(unchanged)2592 unchanged = dict(unchanged)
2593 new_desired_files = []2593 new_desired_files = []
2594 count = 02594 count = 0
2595 for file_id, (trans_id, tree_path) in desired_files:2595 for file_id, (trans_id, tree_path, text_sha1) in desired_files:
2596 accelerator_path = unchanged.get(file_id)2596 accelerator_path = unchanged.get(file_id)
2597 if accelerator_path is None:2597 if accelerator_path is None:
2598 new_desired_files.append((file_id, (trans_id, tree_path)))2598 new_desired_files.append((file_id,
2599 (trans_id, tree_path, text_sha1)))
2599 continue2600 continue
2600 pb.update('Adding file contents', count + offset, total)2601 pb.update('Adding file contents', count + offset, total)
2601 if hardlink:2602 if hardlink:
@@ -2608,7 +2609,7 @@
2608 contents = filtered_output_bytes(contents, filters,2609 contents = filtered_output_bytes(contents, filters,
2609 ContentFilterContext(tree_path, tree))2610 ContentFilterContext(tree_path, tree))
2610 try:2611 try:
2611 tt.create_file(contents, trans_id)2612 tt.create_file(contents, trans_id, sha1=text_sha1)
2612 finally:2613 finally:
2613 try:2614 try:
2614 contents.close()2615 contents.close()
@@ -2617,13 +2618,13 @@
2617 pass2618 pass
2618 count += 12619 count += 1
2619 offset += count2620 offset += count
2620 for count, ((trans_id, tree_path), contents) in enumerate(2621 for count, ((trans_id, tree_path, text_sha1), contents) in enumerate(
2621 tree.iter_files_bytes(new_desired_files)):2622 tree.iter_files_bytes(new_desired_files)):
2622 if wt.supports_content_filtering():2623 if wt.supports_content_filtering():
2623 filters = wt._content_filter_stack(tree_path)2624 filters = wt._content_filter_stack(tree_path)
2624 contents = filtered_output_bytes(contents, filters,2625 contents = filtered_output_bytes(contents, filters,
2625 ContentFilterContext(tree_path, tree))2626 ContentFilterContext(tree_path, tree))
2626 tt.create_file(contents, trans_id)2627 tt.create_file(contents, trans_id, sha1=text_sha1)
2627 pb.update('Adding file contents', count + offset, total)2628 pb.update('Adding file contents', count + offset, total)
26282629
26292630
26302631
=== modified file 'doc/en/release-notes/bzr-2.4.txt'
--- doc/en/release-notes/bzr-2.4.txt 2011-04-08 10:54:03 +0000
+++ doc/en/release-notes/bzr-2.4.txt 2011-04-08 13:34:19 +0000
@@ -26,6 +26,12 @@
26.. Improvements to existing commands, especially improved performance 26.. Improvements to existing commands, especially improved performance
27 or memory usage, or better results.27 or memory usage, or better results.
2828
29* When building a new WorkingTree (such as during ``bzr co`` or
30 ``bzr branch``) we now properly store the stat and hash of files that
31 are old enough. This saves a fair amount of time on the first
32 ``bzr status`` (on a 500MB tree, it saves about 30+s).
33 (John Arbash Meinel, #740932)
34
29* Resolve ``lp:FOO`` urls locally rather than doing an XMLRPC request if35* Resolve ``lp:FOO`` urls locally rather than doing an XMLRPC request if
30 the user has done ``bzr launchpad-login``. The bzr+ssh URLs were already36 the user has done ``bzr launchpad-login``. The bzr+ssh URLs were already
31 being handed off to the remote server anyway (xmlrpc has been mapping37 being handed off to the remote server anyway (xmlrpc has been mapping