Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Superseded
Proposed branch: lp:~ian-clatworthy/bzr/faster-dirstate-saving
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 268 lines
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving
Reviewer Review Type Date Requested Status
bzr-core Pending
Review via email: mp+6841@code.launchpad.net

This proposal has been superseded by a proposal from 2009-05-28.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Revision history for this message
Andrew Bennetts (spiv) wrote :

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/_dirstate_helpers_c.pyx'
2--- bzrlib/_dirstate_helpers_c.pyx 2009-03-23 14:59:43 +0000
3+++ bzrlib/_dirstate_helpers_c.pyx 2009-05-28 12:35:22 +0000
4@@ -909,7 +909,7 @@
5 else:
6 entry[1][0] = ('l', '', stat_value.st_size,
7 False, DirState.NULLSTAT)
8- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
9+ self._mark_modified([entry])
10 return link_or_sha1
11
12
13
14=== modified file 'bzrlib/dirstate.py'
15--- bzrlib/dirstate.py 2009-05-06 05:36:28 +0000
16+++ bzrlib/dirstate.py 2009-05-28 12:35:22 +0000
17@@ -409,11 +409,46 @@
18 # during commit.
19 self._last_block_index = None
20 self._last_entry_index = None
21+ # If True, use the per-entry field cache for faster serialisation.
22+ # If False, disable it. If None, it is not used but may be enabled.
23+ self._use_smart_saving = None
24+ # The set of known changes
25+ self._known_changes = set()
26+ # The cache of serialised lines. When built, this is a tuple of
27+ # 2 sorted lists that we "walk" while serialising.
28+ self._line_cache = None
29
30 def __repr__(self):
31 return "%s(%r)" % \
32 (self.__class__.__name__, self._filename)
33
34+ def _mark_modified(self, entries=None, header_too=False):
35+ """Mark this dirstate as modified.
36+
37+ :param entries: if non-None, mark just these entries as modified.
38+ :param header_too: mark the header modified as well, not just the
39+ dirblocks.
40+ """
41+ #trace.mutter_callsite(3, "modified entries: %s", entries)
42+ if entries:
43+ self._known_changes.update([e[0] for e in entries])
44+ # We only enable save saving is it hasn't already been disabled
45+ if self._use_smart_saving is not False:
46+ self._use_smart_saving = True
47+ else:
48+ # We don't know exactly what changed so disable smart saving
49+ self._use_smart_saving = False
50+ self._dirblock_state = DirState.IN_MEMORY_MODIFIED
51+ if header_too:
52+ self._header_state = DirState.IN_MEMORY_MODIFIED
53+
54+ def _mark_unmodified(self):
55+ """Mark this dirstate as unmodified."""
56+ self._header_state = DirState.IN_MEMORY_UNMODIFIED
57+ self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
58+ self._use_smart_saving = None
59+ self._known_changes = set()
60+
61 def add(self, path, file_id, kind, stat, fingerprint):
62 """Add a path to be tracked.
63
64@@ -545,7 +580,7 @@
65 if kind == 'directory':
66 # insert a new dirblock
67 self._ensure_block(block_index, entry_index, utf8path)
68- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
69+ self._mark_modified()
70 if self._id_index:
71 self._id_index.setdefault(entry_key[2], set()).add(entry_key)
72
73@@ -1017,8 +1052,7 @@
74
75 self._ghosts = []
76 self._parents = [parents[0]]
77- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
78- self._header_state = DirState.IN_MEMORY_MODIFIED
79+ self._mark_modified(header_too=True)
80
81 def _empty_parent_info(self):
82 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
83@@ -1460,8 +1494,7 @@
84 # Apply in-situ changes.
85 self._update_basis_apply_changes(changes)
86
87- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
88- self._header_state = DirState.IN_MEMORY_MODIFIED
89+ self._mark_modified(header_too=True)
90 self._id_index = None
91 return
92
93@@ -1594,7 +1627,7 @@
94 and stat_value.st_ctime < self._cutoff_time):
95 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
96 packed_stat)
97- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
98+ self._mark_modified()
99
100 def _sha_cutoff_time(self):
101 """Return cutoff time.
102@@ -1658,14 +1691,13 @@
103 """Serialise the entire dirstate to a sequence of lines."""
104 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
105 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
106- # read whats on disk.
107+ # read what's on disk.
108 self._state_file.seek(0)
109 return self._state_file.readlines()
110 lines = []
111 lines.append(self._get_parents_line(self.get_parent_ids()))
112 lines.append(self._get_ghosts_line(self._ghosts))
113- # append the root line which is special cased
114- lines.extend(map(self._entry_to_line, self._iter_entries()))
115+ lines.extend(self._get_entry_lines())
116 return self._get_output_lines(lines)
117
118 def _get_ghosts_line(self, ghost_ids):
119@@ -1676,6 +1708,35 @@
120 """Create a line for the state file for parents information."""
121 return '\0'.join([str(len(parent_ids))] + parent_ids)
122
123+ def _get_entry_lines(self):
124+ """Create lines for entries."""
125+ if self._use_smart_saving and self._line_cache:
126+ # We unroll this case for better performance ...
127+ # The line cache is a tuple of 2 ordered lists: keys and lines.
128+ # We keep track of successful matches and only search from there
129+ # on next time.
130+ entry_to_line = self._entry_to_line
131+ known_changes = self._known_changes
132+ index = 0
133+ keys, serialised = self._line_cache
134+ result = []
135+ for entry in self._iter_entries():
136+ key = entry[0]
137+ if key in known_changes:
138+ result.append(entry_to_line(entry))
139+ else:
140+ if keys[index] != key:
141+ try:
142+ index = keys.index(key, index + 1)
143+ except ValueError:
144+ result.append(entry_to_line(entry))
145+ continue
146+ result.append(serialised[index])
147+ index += 1
148+ return result
149+ else:
150+ return map(self._entry_to_line, self._iter_entries())
151+
152 def _get_fields_to_entry(self):
153 """Get a function which converts entry fields into a entry record.
154
155@@ -2057,6 +2118,39 @@
156 self._read_header_if_needed()
157 if self._dirblock_state == DirState.NOT_IN_MEMORY:
158 _read_dirblocks(self)
159+ # While it's a small overhead, it's good to build the line cache
160+ # now while we know that the dirstate is loaded and unmodified.
161+ # It we leave it till later, it takes a while longer because the
162+ # memory representation and file representation are no longer
163+ # in sync.
164+ self._build_line_cache()
165+
166+ def _build_line_cache(self):
167+ """Build the line cache.
168+
169+ The line cache maps entry keys to serialised lines via
170+ a tuple of 2 sorted lists.
171+ """
172+ self._state_file.seek(0)
173+ lines = self._state_file.readlines()
174+ # There are 5 header lines: 3 in the prelude, a line for
175+ # parents and a line for ghosts. There is also a trailing
176+ # empty line. We skip over those.
177+ # Each line starts with a null and ends with a null and
178+ # newline. We don't keep those because the serialisation
179+ # process adds them.
180+ values = [l[1:-2] for l in lines[5:-1]]
181+ if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
182+ keys = []
183+ for directory in self._dirblocks:
184+ keys.extend([e[0] for e in directory[1]])
185+ else:
186+ # Be safe and calculate the keys from the lines
187+ keys = []
188+ for v in values:
189+ fields = v.split('\0', 3)
190+ keys.append((fields[0], fields[1], fields[2]))
191+ self._line_cache = (keys, values)
192
193 def _read_header(self):
194 """This reads in the metadata header, and the parent ids.
195@@ -2162,8 +2256,7 @@
196 self._state_file.writelines(self.get_lines())
197 self._state_file.truncate()
198 self._state_file.flush()
199- self._header_state = DirState.IN_MEMORY_UNMODIFIED
200- self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
201+ self._mark_unmodified()
202 finally:
203 if grabbed_write_lock:
204 self._lock_token = self._lock_token.restore_read_lock()
205@@ -2185,8 +2278,7 @@
206 """
207 # our memory copy is now authoritative.
208 self._dirblocks = dirblocks
209- self._header_state = DirState.IN_MEMORY_MODIFIED
210- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
211+ self._mark_modified(header_too=True)
212 self._parents = list(parent_ids)
213 self._id_index = None
214 self._packed_stat_index = None
215@@ -2212,7 +2304,7 @@
216 self._make_absent(entry)
217 self.update_minimal(('', '', new_id), 'd',
218 path_utf8='', packed_stat=entry[1][0][4])
219- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
220+ self._mark_modified()
221 if self._id_index is not None:
222 self._id_index.setdefault(new_id, set()).add(entry[0])
223
224@@ -2352,8 +2444,7 @@
225 self._entries_to_current_state(new_entries)
226 self._parents = [rev_id for rev_id, tree in trees]
227 self._ghosts = list(ghosts)
228- self._header_state = DirState.IN_MEMORY_MODIFIED
229- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
230+ self._mark_modified(header_too=True)
231 self._id_index = id_index
232
233 def _sort_entries(self, entry_list):
234@@ -2471,7 +2562,7 @@
235 # without seeing it in the new list. so it must be gone.
236 self._make_absent(current_old)
237 current_old = advance(old_iterator)
238- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
239+ self._mark_modified()
240 self._id_index = None
241 self._packed_stat_index = None
242
243@@ -2524,7 +2615,7 @@
244 if update_tree_details[0][0] == 'a': # absent
245 raise AssertionError('bad row %r' % (update_tree_details,))
246 update_tree_details[0] = DirState.NULL_PARENT_DETAILS
247- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
248+ self._mark_modified()
249 return last_reference
250
251 def update_minimal(self, key, minikind, executable=False, fingerprint='',
252@@ -2650,7 +2741,7 @@
253 if not present:
254 self._dirblocks.insert(block_index, (subdir_key[0], []))
255
256- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
257+ self._mark_modified()
258
259 def _validate(self):
260 """Check that invariants on the dirblock are correct.
261@@ -2936,7 +3027,7 @@
262 else:
263 entry[1][0] = ('l', '', stat_value.st_size,
264 False, DirState.NULLSTAT)
265- state._dirblock_state = DirState.IN_MEMORY_MODIFIED
266+ state._mark_modified([entry])
267 return link_or_sha1
268 update_entry = py_update_entry
269