Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old
- faster-dirstate-saving
- Merge into trunk-old
Status: | Superseded | ||||
---|---|---|---|---|---|
Proposed branch: | lp:~ian-clatworthy/bzr/faster-dirstate-saving | ||||
Merge into: | lp:~bzr/bzr/trunk-old | ||||
Diff against target: | 268 lines | ||||
To merge this branch: | bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving | ||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
bzr-core | Pending | ||
Review via email: mp+6841@code.launchpad.net |
This proposal has been superseded by a proposal from 2009-05-28.
Commit message
Description of the change
Ian Clatworthy (ian-clatworthy) wrote : | # |
Andrew Bennetts (spiv) wrote : | # |
Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)
Ian Clatworthy (ian-clatworthy) wrote : | # |
> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)
Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.
John A Meinel (jameinel) wrote : | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.
A guideline is: "Don't save if it would cost as much data as you avoid
reading".
The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.
If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAko
Y6MAn1FfgUIcKOS
=i1ML
-----END PGP SIGNATURE-----
Martin Pool (mbp) wrote : | # |
2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.
It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.
--
Martin <http://
Preview Diff
1 | === modified file 'bzrlib/_dirstate_helpers_c.pyx' | |||
2 | --- bzrlib/_dirstate_helpers_c.pyx 2009-03-23 14:59:43 +0000 | |||
3 | +++ bzrlib/_dirstate_helpers_c.pyx 2009-05-28 12:35:22 +0000 | |||
4 | @@ -909,7 +909,7 @@ | |||
5 | 909 | else: | 909 | else: |
6 | 910 | entry[1][0] = ('l', '', stat_value.st_size, | 910 | entry[1][0] = ('l', '', stat_value.st_size, |
7 | 911 | False, DirState.NULLSTAT) | 911 | False, DirState.NULLSTAT) |
9 | 912 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 912 | self._mark_modified([entry]) |
10 | 913 | return link_or_sha1 | 913 | return link_or_sha1 |
11 | 914 | 914 | ||
12 | 915 | 915 | ||
13 | 916 | 916 | ||
14 | === modified file 'bzrlib/dirstate.py' | |||
15 | --- bzrlib/dirstate.py 2009-05-06 05:36:28 +0000 | |||
16 | +++ bzrlib/dirstate.py 2009-05-28 12:35:22 +0000 | |||
17 | @@ -409,11 +409,46 @@ | |||
18 | 409 | # during commit. | 409 | # during commit. |
19 | 410 | self._last_block_index = None | 410 | self._last_block_index = None |
20 | 411 | self._last_entry_index = None | 411 | self._last_entry_index = None |
21 | 412 | # If True, use the per-entry field cache for faster serialisation. | ||
22 | 413 | # If False, disable it. If None, it is not used but may be enabled. | ||
23 | 414 | self._use_smart_saving = None | ||
24 | 415 | # The set of known changes | ||
25 | 416 | self._known_changes = set() | ||
26 | 417 | # The cache of serialised lines. When built, this is a tuple of | ||
27 | 418 | # 2 sorted lists that we "walk" while serialising. | ||
28 | 419 | self._line_cache = None | ||
29 | 412 | 420 | ||
30 | 413 | def __repr__(self): | 421 | def __repr__(self): |
31 | 414 | return "%s(%r)" % \ | 422 | return "%s(%r)" % \ |
32 | 415 | (self.__class__.__name__, self._filename) | 423 | (self.__class__.__name__, self._filename) |
33 | 416 | 424 | ||
34 | 425 | def _mark_modified(self, entries=None, header_too=False): | ||
35 | 426 | """Mark this dirstate as modified. | ||
36 | 427 | |||
37 | 428 | :param entries: if non-None, mark just these entries as modified. | ||
38 | 429 | :param header_too: mark the header modified as well, not just the | ||
39 | 430 | dirblocks. | ||
40 | 431 | """ | ||
41 | 432 | #trace.mutter_callsite(3, "modified entries: %s", entries) | ||
42 | 433 | if entries: | ||
43 | 434 | self._known_changes.update([e[0] for e in entries]) | ||
44 | 435 | # We only enable save saving is it hasn't already been disabled | ||
45 | 436 | if self._use_smart_saving is not False: | ||
46 | 437 | self._use_smart_saving = True | ||
47 | 438 | else: | ||
48 | 439 | # We don't know exactly what changed so disable smart saving | ||
49 | 440 | self._use_smart_saving = False | ||
50 | 441 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | ||
51 | 442 | if header_too: | ||
52 | 443 | self._header_state = DirState.IN_MEMORY_MODIFIED | ||
53 | 444 | |||
54 | 445 | def _mark_unmodified(self): | ||
55 | 446 | """Mark this dirstate as unmodified.""" | ||
56 | 447 | self._header_state = DirState.IN_MEMORY_UNMODIFIED | ||
57 | 448 | self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED | ||
58 | 449 | self._use_smart_saving = None | ||
59 | 450 | self._known_changes = set() | ||
60 | 451 | |||
61 | 417 | def add(self, path, file_id, kind, stat, fingerprint): | 452 | def add(self, path, file_id, kind, stat, fingerprint): |
62 | 418 | """Add a path to be tracked. | 453 | """Add a path to be tracked. |
63 | 419 | 454 | ||
64 | @@ -545,7 +580,7 @@ | |||
65 | 545 | if kind == 'directory': | 580 | if kind == 'directory': |
66 | 546 | # insert a new dirblock | 581 | # insert a new dirblock |
67 | 547 | self._ensure_block(block_index, entry_index, utf8path) | 582 | self._ensure_block(block_index, entry_index, utf8path) |
69 | 548 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 583 | self._mark_modified() |
70 | 549 | if self._id_index: | 584 | if self._id_index: |
71 | 550 | self._id_index.setdefault(entry_key[2], set()).add(entry_key) | 585 | self._id_index.setdefault(entry_key[2], set()).add(entry_key) |
72 | 551 | 586 | ||
73 | @@ -1017,8 +1052,7 @@ | |||
74 | 1017 | 1052 | ||
75 | 1018 | self._ghosts = [] | 1053 | self._ghosts = [] |
76 | 1019 | self._parents = [parents[0]] | 1054 | self._parents = [parents[0]] |
79 | 1020 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 1055 | self._mark_modified(header_too=True) |
78 | 1021 | self._header_state = DirState.IN_MEMORY_MODIFIED | ||
80 | 1022 | 1056 | ||
81 | 1023 | def _empty_parent_info(self): | 1057 | def _empty_parent_info(self): |
82 | 1024 | return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) - | 1058 | return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) - |
83 | @@ -1460,8 +1494,7 @@ | |||
84 | 1460 | # Apply in-situ changes. | 1494 | # Apply in-situ changes. |
85 | 1461 | self._update_basis_apply_changes(changes) | 1495 | self._update_basis_apply_changes(changes) |
86 | 1462 | 1496 | ||
89 | 1463 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 1497 | self._mark_modified(header_too=True) |
88 | 1464 | self._header_state = DirState.IN_MEMORY_MODIFIED | ||
90 | 1465 | self._id_index = None | 1498 | self._id_index = None |
91 | 1466 | return | 1499 | return |
92 | 1467 | 1500 | ||
93 | @@ -1594,7 +1627,7 @@ | |||
94 | 1594 | and stat_value.st_ctime < self._cutoff_time): | 1627 | and stat_value.st_ctime < self._cutoff_time): |
95 | 1595 | entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3], | 1628 | entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3], |
96 | 1596 | packed_stat) | 1629 | packed_stat) |
98 | 1597 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 1630 | self._mark_modified() |
99 | 1598 | 1631 | ||
100 | 1599 | def _sha_cutoff_time(self): | 1632 | def _sha_cutoff_time(self): |
101 | 1600 | """Return cutoff time. | 1633 | """Return cutoff time. |
102 | @@ -1658,14 +1691,13 @@ | |||
103 | 1658 | """Serialise the entire dirstate to a sequence of lines.""" | 1691 | """Serialise the entire dirstate to a sequence of lines.""" |
104 | 1659 | if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and | 1692 | if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and |
105 | 1660 | self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED): | 1693 | self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED): |
107 | 1661 | # read whats on disk. | 1694 | # read what's on disk. |
108 | 1662 | self._state_file.seek(0) | 1695 | self._state_file.seek(0) |
109 | 1663 | return self._state_file.readlines() | 1696 | return self._state_file.readlines() |
110 | 1664 | lines = [] | 1697 | lines = [] |
111 | 1665 | lines.append(self._get_parents_line(self.get_parent_ids())) | 1698 | lines.append(self._get_parents_line(self.get_parent_ids())) |
112 | 1666 | lines.append(self._get_ghosts_line(self._ghosts)) | 1699 | lines.append(self._get_ghosts_line(self._ghosts)) |
115 | 1667 | # append the root line which is special cased | 1700 | lines.extend(self._get_entry_lines()) |
114 | 1668 | lines.extend(map(self._entry_to_line, self._iter_entries())) | ||
116 | 1669 | return self._get_output_lines(lines) | 1701 | return self._get_output_lines(lines) |
117 | 1670 | 1702 | ||
118 | 1671 | def _get_ghosts_line(self, ghost_ids): | 1703 | def _get_ghosts_line(self, ghost_ids): |
119 | @@ -1676,6 +1708,35 @@ | |||
120 | 1676 | """Create a line for the state file for parents information.""" | 1708 | """Create a line for the state file for parents information.""" |
121 | 1677 | return '\0'.join([str(len(parent_ids))] + parent_ids) | 1709 | return '\0'.join([str(len(parent_ids))] + parent_ids) |
122 | 1678 | 1710 | ||
123 | 1711 | def _get_entry_lines(self): | ||
124 | 1712 | """Create lines for entries.""" | ||
125 | 1713 | if self._use_smart_saving and self._line_cache: | ||
126 | 1714 | # We unroll this case for better performance ... | ||
127 | 1715 | # The line cache is a tuple of 2 ordered lists: keys and lines. | ||
128 | 1716 | # We keep track of successful matches and only search from there | ||
129 | 1717 | # on next time. | ||
130 | 1718 | entry_to_line = self._entry_to_line | ||
131 | 1719 | known_changes = self._known_changes | ||
132 | 1720 | index = 0 | ||
133 | 1721 | keys, serialised = self._line_cache | ||
134 | 1722 | result = [] | ||
135 | 1723 | for entry in self._iter_entries(): | ||
136 | 1724 | key = entry[0] | ||
137 | 1725 | if key in known_changes: | ||
138 | 1726 | result.append(entry_to_line(entry)) | ||
139 | 1727 | else: | ||
140 | 1728 | if keys[index] != key: | ||
141 | 1729 | try: | ||
142 | 1730 | index = keys.index(key, index + 1) | ||
143 | 1731 | except ValueError: | ||
144 | 1732 | result.append(entry_to_line(entry)) | ||
145 | 1733 | continue | ||
146 | 1734 | result.append(serialised[index]) | ||
147 | 1735 | index += 1 | ||
148 | 1736 | return result | ||
149 | 1737 | else: | ||
150 | 1738 | return map(self._entry_to_line, self._iter_entries()) | ||
151 | 1739 | |||
152 | 1679 | def _get_fields_to_entry(self): | 1740 | def _get_fields_to_entry(self): |
153 | 1680 | """Get a function which converts entry fields into a entry record. | 1741 | """Get a function which converts entry fields into a entry record. |
154 | 1681 | 1742 | ||
155 | @@ -2057,6 +2118,39 @@ | |||
156 | 2057 | self._read_header_if_needed() | 2118 | self._read_header_if_needed() |
157 | 2058 | if self._dirblock_state == DirState.NOT_IN_MEMORY: | 2119 | if self._dirblock_state == DirState.NOT_IN_MEMORY: |
158 | 2059 | _read_dirblocks(self) | 2120 | _read_dirblocks(self) |
159 | 2121 | # While it's a small overhead, it's good to build the line cache | ||
160 | 2122 | # now while we know that the dirstate is loaded and unmodified. | ||
161 | 2123 | # It we leave it till later, it takes a while longer because the | ||
162 | 2124 | # memory representation and file representation are no longer | ||
163 | 2125 | # in sync. | ||
164 | 2126 | self._build_line_cache() | ||
165 | 2127 | |||
166 | 2128 | def _build_line_cache(self): | ||
167 | 2129 | """Build the line cache. | ||
168 | 2130 | |||
169 | 2131 | The line cache maps entry keys to serialised lines via | ||
170 | 2132 | a tuple of 2 sorted lists. | ||
171 | 2133 | """ | ||
172 | 2134 | self._state_file.seek(0) | ||
173 | 2135 | lines = self._state_file.readlines() | ||
174 | 2136 | # There are 5 header lines: 3 in the prelude, a line for | ||
175 | 2137 | # parents and a line for ghosts. There is also a trailing | ||
176 | 2138 | # empty line. We skip over those. | ||
177 | 2139 | # Each line starts with a null and ends with a null and | ||
178 | 2140 | # newline. We don't keep those because the serialisation | ||
179 | 2141 | # process adds them. | ||
180 | 2142 | values = [l[1:-2] for l in lines[5:-1]] | ||
181 | 2143 | if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED: | ||
182 | 2144 | keys = [] | ||
183 | 2145 | for directory in self._dirblocks: | ||
184 | 2146 | keys.extend([e[0] for e in directory[1]]) | ||
185 | 2147 | else: | ||
186 | 2148 | # Be safe and calculate the keys from the lines | ||
187 | 2149 | keys = [] | ||
188 | 2150 | for v in values: | ||
189 | 2151 | fields = v.split('\0', 3) | ||
190 | 2152 | keys.append((fields[0], fields[1], fields[2])) | ||
191 | 2153 | self._line_cache = (keys, values) | ||
192 | 2060 | 2154 | ||
193 | 2061 | def _read_header(self): | 2155 | def _read_header(self): |
194 | 2062 | """This reads in the metadata header, and the parent ids. | 2156 | """This reads in the metadata header, and the parent ids. |
195 | @@ -2162,8 +2256,7 @@ | |||
196 | 2162 | self._state_file.writelines(self.get_lines()) | 2256 | self._state_file.writelines(self.get_lines()) |
197 | 2163 | self._state_file.truncate() | 2257 | self._state_file.truncate() |
198 | 2164 | self._state_file.flush() | 2258 | self._state_file.flush() |
201 | 2165 | self._header_state = DirState.IN_MEMORY_UNMODIFIED | 2259 | self._mark_unmodified() |
200 | 2166 | self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED | ||
202 | 2167 | finally: | 2260 | finally: |
203 | 2168 | if grabbed_write_lock: | 2261 | if grabbed_write_lock: |
204 | 2169 | self._lock_token = self._lock_token.restore_read_lock() | 2262 | self._lock_token = self._lock_token.restore_read_lock() |
205 | @@ -2185,8 +2278,7 @@ | |||
206 | 2185 | """ | 2278 | """ |
207 | 2186 | # our memory copy is now authoritative. | 2279 | # our memory copy is now authoritative. |
208 | 2187 | self._dirblocks = dirblocks | 2280 | self._dirblocks = dirblocks |
211 | 2188 | self._header_state = DirState.IN_MEMORY_MODIFIED | 2281 | self._mark_modified(header_too=True) |
210 | 2189 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | ||
212 | 2190 | self._parents = list(parent_ids) | 2282 | self._parents = list(parent_ids) |
213 | 2191 | self._id_index = None | 2283 | self._id_index = None |
214 | 2192 | self._packed_stat_index = None | 2284 | self._packed_stat_index = None |
215 | @@ -2212,7 +2304,7 @@ | |||
216 | 2212 | self._make_absent(entry) | 2304 | self._make_absent(entry) |
217 | 2213 | self.update_minimal(('', '', new_id), 'd', | 2305 | self.update_minimal(('', '', new_id), 'd', |
218 | 2214 | path_utf8='', packed_stat=entry[1][0][4]) | 2306 | path_utf8='', packed_stat=entry[1][0][4]) |
220 | 2215 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2307 | self._mark_modified() |
221 | 2216 | if self._id_index is not None: | 2308 | if self._id_index is not None: |
222 | 2217 | self._id_index.setdefault(new_id, set()).add(entry[0]) | 2309 | self._id_index.setdefault(new_id, set()).add(entry[0]) |
223 | 2218 | 2310 | ||
224 | @@ -2352,8 +2444,7 @@ | |||
225 | 2352 | self._entries_to_current_state(new_entries) | 2444 | self._entries_to_current_state(new_entries) |
226 | 2353 | self._parents = [rev_id for rev_id, tree in trees] | 2445 | self._parents = [rev_id for rev_id, tree in trees] |
227 | 2354 | self._ghosts = list(ghosts) | 2446 | self._ghosts = list(ghosts) |
230 | 2355 | self._header_state = DirState.IN_MEMORY_MODIFIED | 2447 | self._mark_modified(header_too=True) |
229 | 2356 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | ||
231 | 2357 | self._id_index = id_index | 2448 | self._id_index = id_index |
232 | 2358 | 2449 | ||
233 | 2359 | def _sort_entries(self, entry_list): | 2450 | def _sort_entries(self, entry_list): |
234 | @@ -2471,7 +2562,7 @@ | |||
235 | 2471 | # without seeing it in the new list. so it must be gone. | 2562 | # without seeing it in the new list. so it must be gone. |
236 | 2472 | self._make_absent(current_old) | 2563 | self._make_absent(current_old) |
237 | 2473 | current_old = advance(old_iterator) | 2564 | current_old = advance(old_iterator) |
239 | 2474 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2565 | self._mark_modified() |
240 | 2475 | self._id_index = None | 2566 | self._id_index = None |
241 | 2476 | self._packed_stat_index = None | 2567 | self._packed_stat_index = None |
242 | 2477 | 2568 | ||
243 | @@ -2524,7 +2615,7 @@ | |||
244 | 2524 | if update_tree_details[0][0] == 'a': # absent | 2615 | if update_tree_details[0][0] == 'a': # absent |
245 | 2525 | raise AssertionError('bad row %r' % (update_tree_details,)) | 2616 | raise AssertionError('bad row %r' % (update_tree_details,)) |
246 | 2526 | update_tree_details[0] = DirState.NULL_PARENT_DETAILS | 2617 | update_tree_details[0] = DirState.NULL_PARENT_DETAILS |
248 | 2527 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2618 | self._mark_modified() |
249 | 2528 | return last_reference | 2619 | return last_reference |
250 | 2529 | 2620 | ||
251 | 2530 | def update_minimal(self, key, minikind, executable=False, fingerprint='', | 2621 | def update_minimal(self, key, minikind, executable=False, fingerprint='', |
252 | @@ -2650,7 +2741,7 @@ | |||
253 | 2650 | if not present: | 2741 | if not present: |
254 | 2651 | self._dirblocks.insert(block_index, (subdir_key[0], [])) | 2742 | self._dirblocks.insert(block_index, (subdir_key[0], [])) |
255 | 2652 | 2743 | ||
257 | 2653 | self._dirblock_state = DirState.IN_MEMORY_MODIFIED | 2744 | self._mark_modified() |
258 | 2654 | 2745 | ||
259 | 2655 | def _validate(self): | 2746 | def _validate(self): |
260 | 2656 | """Check that invariants on the dirblock are correct. | 2747 | """Check that invariants on the dirblock are correct. |
261 | @@ -2936,7 +3027,7 @@ | |||
262 | 2936 | else: | 3027 | else: |
263 | 2937 | entry[1][0] = ('l', '', stat_value.st_size, | 3028 | entry[1][0] = ('l', '', stat_value.st_size, |
264 | 2938 | False, DirState.NULLSTAT) | 3029 | False, DirState.NULLSTAT) |
266 | 2939 | state._dirblock_state = DirState.IN_MEMORY_MODIFIED | 3030 | state._mark_modified([entry]) |
267 | 2940 | return link_or_sha1 | 3031 | return link_or_sha1 |
268 | 2941 | update_entry = py_update_entry | 3032 | update_entry = py_update_entry |
269 | 2942 | 3033 |
Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.
This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.
Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.
More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.
Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.
Thoughts?