Merge lp:~jameinel/bzr/2.4-transform-cache-sha-740932 into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: Andrew Bennetts
Approved revision: no longer in the source branch.
Merged at revision: 5775
Proposed branch: lp:~jameinel/bzr/2.4-transform-cache-sha-740932
Merge into: lp:bzr
Prerequisite: lp:~jameinel/bzr/2.4-transform-create-file-740932
Diff against target: 134 lines (+61/-6)
3 files modified
bzrlib/tests/test_transform.py (+48/-0)
bzrlib/transform.py (+7/-6)
doc/en/release-notes/bzr-2.4.txt (+6/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.4-transform-cache-sha-740932
Reviewer Review Type Date Requested Status
Andrew Bennetts Approve
Review via email: mp+56364@code.launchpad.net

Commit message

Change ``bzrlib.transform.build_tree`` to pass the sha1 of newly created files. This avoids re-hashing content on the next 'bzr st'. Saving ~30s on a gcc-sized tree.

Description of the change

This is finally something that fixes the 'bzr status' time after 'bzr checkout'.

Specifically it updates 'build_tree' to save and pass around the sha1 to Transform.create_file.

The difference is pretty noticeable:

Without:
1m13s time bzr co --lightweight
  37s time bzr st
   1s time bzr st

With:

1m13s time bzr co --lightweight
   5s time bzr st
   1s time bzr st

Note, we still have some overhead. Specifically, we *only* cache the stat info for files. We don't really have a way to do so for Directories. So the first 'bzr st' still stats all the dirs and ends up rewriting the Dirstate file. However, we *don't* spend 32s re-reading and re-sha1ing all 576MB of gcc's tree. And even if we did, there would still be *some* files that are going to be within our 3s window, unless we add a time.sleep(3) before we call _apply_observed_sha1s. I'm considering adding it for commands that take longer than 30s during the TreeTransform phase (so it is less than 10% overhead.)

The one downside is it is one more O(new_files) dict. Which then stores a tuple of sha and stat value. On my 64-bit machine, the numbers are:

w/o: VmPeak: 721700 kB
w/: VmPeak: 745276 kB

Given we are already in the 700MB range, I think we can live with another 24MB. I suppose that is one of the peak-memory conditions that I should profile and reduce. :)

To post a comment you must log in.
Revision history for this message
Martin Pool (mbp) wrote :

Thanks for tackling this.

I wonder if we can take a further step in this direction still within
the current format by just making sure we never actually hash things
from disk: it's enough to just say that they're either identical to
the current tree, or they're unknown. Recording hashes different to
those in any of the basis trees has very little value.

I don't see why we would be recording or checking the stat values of
directories at all. I don't think it can save us any time.

Martin

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/06/2011 02:40 AM, Martin Pool wrote:
> Thanks for tackling this.
>
> I wonder if we can take a further step in this direction still within
> the current format by just making sure we never actually hash things
> from disk: it's enough to just say that they're either identical to
> the current tree, or they're unknown. Recording hashes different to
> those in any of the basis trees has very little value.

I think first we need to establish that they *are* identical to the
basis. The next is actually getting the higher levels to understand
"iter_maybe_changes". So that you can understand that the values you are
getting *might* be changed, but it isn't guaranteed.

>
> I don't see why we would be recording or checking the stat values of
> directories at all. I don't think it can save us any time.
>
> Martin

I have the feeling the code just doesn't pay attention to type. We need
to call "update_entry", in general, because a directory could have
become a file, etc. I agree that I don't think we need to specifically
be caching the stat information for it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2cEA0ACgkQJdeBCYSNAAMuQwCdHXstnwyLMSJS2z7smu/Pxr9E
XoMAniv+lpEtzZcD5FA7eQ/rBz5Tg7tr
=gM1C
-----END PGP SIGNATURE-----

Revision history for this message
Andrew Bennetts (spiv) wrote :

The change to bzrlib/dirstate.py looks reasonable (it's certainly more readable!) but I'm not sure how it's related to the rest of this patch… hmm, I guess that bit has actually already landed from with your 2.4-observed-sha-updates-size patch anyway.

Anyway, the change and new tests look good to me!

review: Approve
Revision history for this message
Robert Collins (lifeless) wrote :

On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
> Thanks for tackling this.
>
> I wonder if we can take a further step in this direction still within
> the current format by just making sure we never actually hash things
> from disk: it's enough to just say that they're either identical to
> the current tree, or they're unknown.  Recording hashes different to
> those in any of the basis trees has very little value.

Thats true, but its not symptomatic here: we were rehashing to detect
differences and storing the result (same as basis). We already take
care in 'bzr st' not to hash things we *know* are different (e.g.
different file length).

And as John says, we can't tell something is a dir still unless we stat it ;)

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

On 7 April 2011 14:36, Robert Collins <email address hidden> wrote:
> On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
> And as John says, we can't tell something is a directory still unless we stat it ;)

My point is that merely discovering something is still a directory, or
is a directory with a new mtime/size/ctime, should not cause us to
rewrite the dirstate file, and according to John's earlier comment it
does. For a directory, we cannot do anything useful with the stat
cache and so there is no point keeping it up to date.

Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, Apr 7, 2011 at 4:49 PM, Martin Pool <email address hidden> wrote:
> On 7 April 2011 14:36, Robert Collins <email address hidden> wrote:
>> On Wed, Apr 6, 2011 at 12:42 PM, Martin Pool <email address hidden> wrote:
>> And as John says, we can't tell something is a directory still unless we stat it ;)
>
> My point is that merely discovering something is still a directory, or
> is a directory with a new mtime/size/ctime, should not cause us to
> rewrite the dirstate file, and according to John's earlier comment it
> does.  For a directory, we cannot do anything useful with the stat
> cache and so there is no point keeping it up to date.

Ah! I misunderstood - I was focused on 'cost of hashing' not 'cost of
writing a new dirstate file.

Theres a separate flag for dirstate file dirtiness - a threshold below
which a new file isn't written for stat cache changes. Probably stat
changes in directories should not count towards that threshold.

Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/tests/test_transform.py'
2--- bzrlib/tests/test_transform.py 2011-04-05 13:25:50 +0000
3+++ bzrlib/tests/test_transform.py 2011-04-08 13:34:19 +0000
4@@ -1970,6 +1970,18 @@
5 self.addCleanup(target.unlock)
6 self.assertEqual([], list(target.iter_changes(revision_tree)))
7
8+ def test_build_tree_accelerator_tree_observes_sha1(self):
9+ source = self.create_ab_tree()
10+ sha1 = osutils.sha_string('A')
11+ target = self.make_branch_and_tree('target')
12+ target.lock_write()
13+ self.addCleanup(target.unlock)
14+ state = target.current_dirstate()
15+ state._cutoff_time = time.time() + 60
16+ build_tree(source.basis_tree(), target, source)
17+ entry = state._get_entry(0, path_utf8='file1')
18+ self.assertEqual(sha1, entry[1][0][1])
19+
20 def test_build_tree_accelerator_tree_missing_file(self):
21 source = self.create_ab_tree()
22 os.unlink('source/file1')
23@@ -2133,6 +2145,42 @@
24 self.assertEqual('file.moved', target.id2path('lower-id'))
25 self.assertEqual('FILE', target.id2path('upper-id'))
26
27+ def test_build_tree_observes_sha(self):
28+ source = self.make_branch_and_tree('source')
29+ self.build_tree(['source/file1', 'source/dir/', 'source/dir/file2'])
30+ source.add(['file1', 'dir', 'dir/file2'],
31+ ['file1-id', 'dir-id', 'file2-id'])
32+ source.commit('new files')
33+ target = self.make_branch_and_tree('target')
34+ target.lock_write()
35+ self.addCleanup(target.unlock)
36+ # We make use of the fact that DirState caches its cutoff time. So we
37+ # set the 'safe' time to one minute in the future.
38+ state = target.current_dirstate()
39+ state._cutoff_time = time.time() + 60
40+ build_tree(source.basis_tree(), target)
41+ entry1_sha = osutils.sha_file_by_name('source/file1')
42+ entry2_sha = osutils.sha_file_by_name('source/dir/file2')
43+ # entry[1] is the state information, entry[1][0] is the state of the
44+ # working tree, entry[1][0][1] is the sha value for the current working
45+ # tree
46+ entry1 = state._get_entry(0, path_utf8='file1')
47+ self.assertEqual(entry1_sha, entry1[1][0][1])
48+ # The 'size' field must also be set.
49+ self.assertEqual(25, entry1[1][0][2])
50+ entry1_state = entry1[1][0]
51+ entry2 = state._get_entry(0, path_utf8='dir/file2')
52+ self.assertEqual(entry2_sha, entry2[1][0][1])
53+ self.assertEqual(29, entry2[1][0][2])
54+ entry2_state = entry2[1][0]
55+ # Now, make sure that we don't have to re-read the content. The
56+ # packed_stat should match exactly.
57+ self.assertEqual(entry1_sha, target.get_file_sha1('file1-id', 'file1'))
58+ self.assertEqual(entry2_sha,
59+ target.get_file_sha1('file2-id', 'dir/file2'))
60+ self.assertEqual(entry1_state, entry1[1][0])
61+ self.assertEqual(entry2_state, entry2[1][0])
62+
63
64 class TestCommitTransform(tests.TestCaseWithTransport):
65
66
67=== modified file 'bzrlib/transform.py'
68--- bzrlib/transform.py 2011-04-07 10:36:24 +0000
69+++ bzrlib/transform.py 2011-04-08 13:34:19 +0000
70@@ -2541,7 +2541,7 @@
71 executable = tree.is_executable(file_id, tree_path)
72 if executable:
73 tt.set_executability(executable, trans_id)
74- trans_data = (trans_id, tree_path)
75+ trans_data = (trans_id, tree_path, entry.text_sha1)
76 deferred_contents.append((file_id, trans_data))
77 else:
78 file_trans_id[file_id] = new_by_entry(tt, entry, parent_id,
79@@ -2592,10 +2592,11 @@
80 unchanged = dict(unchanged)
81 new_desired_files = []
82 count = 0
83- for file_id, (trans_id, tree_path) in desired_files:
84+ for file_id, (trans_id, tree_path, text_sha1) in desired_files:
85 accelerator_path = unchanged.get(file_id)
86 if accelerator_path is None:
87- new_desired_files.append((file_id, (trans_id, tree_path)))
88+ new_desired_files.append((file_id,
89+ (trans_id, tree_path, text_sha1)))
90 continue
91 pb.update('Adding file contents', count + offset, total)
92 if hardlink:
93@@ -2608,7 +2609,7 @@
94 contents = filtered_output_bytes(contents, filters,
95 ContentFilterContext(tree_path, tree))
96 try:
97- tt.create_file(contents, trans_id)
98+ tt.create_file(contents, trans_id, sha1=text_sha1)
99 finally:
100 try:
101 contents.close()
102@@ -2617,13 +2618,13 @@
103 pass
104 count += 1
105 offset += count
106- for count, ((trans_id, tree_path), contents) in enumerate(
107+ for count, ((trans_id, tree_path, text_sha1), contents) in enumerate(
108 tree.iter_files_bytes(new_desired_files)):
109 if wt.supports_content_filtering():
110 filters = wt._content_filter_stack(tree_path)
111 contents = filtered_output_bytes(contents, filters,
112 ContentFilterContext(tree_path, tree))
113- tt.create_file(contents, trans_id)
114+ tt.create_file(contents, trans_id, sha1=text_sha1)
115 pb.update('Adding file contents', count + offset, total)
116
117
118
119=== modified file 'doc/en/release-notes/bzr-2.4.txt'
120--- doc/en/release-notes/bzr-2.4.txt 2011-04-08 10:54:03 +0000
121+++ doc/en/release-notes/bzr-2.4.txt 2011-04-08 13:34:19 +0000
122@@ -26,6 +26,12 @@
123 .. Improvements to existing commands, especially improved performance
124 or memory usage, or better results.
125
126+* When building a new WorkingTree (such as during ``bzr co`` or
127+ ``bzr branch``) we now properly store the stat and hash of files that
128+ are old enough. This saves a fair amount of time on the first
129+ ``bzr status`` (on a 500MB tree, it saves about 30+s).
130+ (John Arbash Meinel, #740932)
131+
132 * Resolve ``lp:FOO`` urls locally rather than doing an XMLRPC request if
133 the user has done ``bzr launchpad-login``. The bzr+ssh URLs were already
134 being handed off to the remote server anyway (xmlrpc has been mapping