Merge lp:~jameinel/bzr/1.17-chk-multilevel into lp:~bzr/bzr/trunk-old
- 1.17-chk-multilevel
- Merge into trunk-old
Status: | Merged |
---|---|
Approved by: | Ian Clatworthy |
Approved revision: | no longer in the source branch. |
Merged at revision: | not available |
Proposed branch: | lp:~jameinel/bzr/1.17-chk-multilevel |
Merge into: | lp:~bzr/bzr/trunk-old |
Diff against target: | 1253 lines |
To merge this branch: | bzr merge lp:~jameinel/bzr/1.17-chk-multilevel |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Robert Collins (community) | Approve | ||
Review via email: mp+8157@code.launchpad.net |
This proposal supersedes a proposal from 2009-07-01.
Commit message
Description of the change
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal | # |
Shallow immediate review:
The docstring for the class, and perhaps the class name could be better.
In terms of sets its:
interesting - uninteresting
So perhaps
CHKMapDifference
as a class name. And for the docstring, something like
"""Iterate the stored pages and key,value pairs for (new - old).
This class provides a generator over the stored CHK pages and the
key,value pairs that are in any of the new maps and not in any of the
old maps.
"""
-Rob
John A Meinel (jameinel) wrote : | # |
Just a follow up from Robert's "immediate" suggestions. Only renames the class and variables. Using "new" and "old" for the sets actually makes it a bit more readable for me. Both because it is shorter, but also because "interesting" and "uninteresting" are 85% the same word.
Robert Collins (lifeless) wrote : | # |
review +1
Some thoughts
process() could just be a itertools.chain:
return itertools.
There is some code still using [un]interesting that you might want to
change (just grep the diff) to be more clear.
Preview Diff
1 | === modified file 'bzrlib/chk_map.py' |
2 | --- bzrlib/chk_map.py 2009-07-01 10:52:23 +0000 |
3 | +++ bzrlib/chk_map.py 2009-07-09 07:36:30 +0000 |
4 | @@ -1396,106 +1396,202 @@ |
5 | return node |
6 | |
7 | |
8 | -def _find_children_info(store, interesting_keys, uninteresting_keys, pb): |
9 | - """Read the associated records, and determine what is interesting.""" |
10 | - uninteresting_keys = set(uninteresting_keys) |
11 | - chks_to_read = uninteresting_keys.union(interesting_keys) |
12 | - next_uninteresting = set() |
13 | - next_interesting = set() |
14 | - next_interesting_intersection = None |
15 | - uninteresting_items = set() |
16 | - interesting_items = set() |
17 | - interesting_to_yield = [] |
18 | - for record in store.get_record_stream(chks_to_read, 'unordered', True): |
19 | - # records_read.add(record.key()) |
20 | - if pb is not None: |
21 | - pb.tick() |
22 | - bytes = record.get_bytes_as('fulltext') |
23 | - # We don't care about search_key_func for this code, because we only |
24 | - # care about external references. |
25 | - node = _deserialise(bytes, record.key, search_key_func=None) |
26 | - if record.key in uninteresting_keys: |
27 | - if type(node) is InternalNode: |
28 | - next_uninteresting.update(node.refs()) |
29 | - else: |
30 | - # We know we are at a LeafNode, so we can pass None for the |
31 | - # store |
32 | - uninteresting_items.update(node.iteritems(None)) |
33 | - else: |
34 | - interesting_to_yield.append(record.key) |
35 | - if type(node) is InternalNode: |
36 | - if next_interesting_intersection is None: |
37 | - next_interesting_intersection = set(node.refs()) |
38 | - else: |
39 | - next_interesting_intersection = \ |
40 | - next_interesting_intersection.intersection(node.refs()) |
41 | - next_interesting.update(node.refs()) |
42 | - else: |
43 | - interesting_items.update(node.iteritems(None)) |
44 | - return (next_uninteresting, uninteresting_items, |
45 | - next_interesting, interesting_to_yield, interesting_items, |
46 | - next_interesting_intersection) |
47 | - |
48 | - |
49 | -def _find_all_uninteresting(store, interesting_root_keys, |
50 | - uninteresting_root_keys, pb): |
51 | - """Determine the full set of uninteresting keys.""" |
52 | - # What about duplicates between interesting_root_keys and |
53 | - # uninteresting_root_keys? |
54 | - if not uninteresting_root_keys: |
55 | - # Shortcut case. We know there is nothing uninteresting to filter out |
56 | - # So we just let the rest of the algorithm do the work |
57 | - # We know there is nothing uninteresting, and we didn't have to read |
58 | - # any interesting records yet. |
59 | - return (set(), set(), set(interesting_root_keys), [], set()) |
60 | - all_uninteresting_chks = set(uninteresting_root_keys) |
61 | - all_uninteresting_items = set() |
62 | - |
63 | - # First step, find the direct children of both the interesting and |
64 | - # uninteresting set |
65 | - (uninteresting_keys, uninteresting_items, |
66 | - interesting_keys, interesting_to_yield, |
67 | - interesting_items, interesting_intersection, |
68 | - ) = _find_children_info(store, interesting_root_keys, |
69 | - uninteresting_root_keys, |
70 | - pb=pb) |
71 | - all_uninteresting_chks.update(uninteresting_keys) |
72 | - all_uninteresting_items.update(uninteresting_items) |
73 | - del uninteresting_items |
74 | - # Do not examine in detail pages common to all interesting trees. |
75 | - # Pages that are common to all interesting trees will have their |
76 | - # older versions found via the uninteresting tree traversal. Some pages |
77 | - # found via the interesting trees traversal will be uninteresting for |
78 | - # other of the interesting trees, which is why we require the pages to be |
79 | - # common for us to trim them. |
80 | - if interesting_intersection is not None: |
81 | - uninteresting_keys.difference_update(interesting_intersection) |
82 | - |
83 | - # Second, find the full set of uninteresting bits reachable by the |
84 | - # uninteresting roots |
85 | - chks_to_read = uninteresting_keys |
86 | - while chks_to_read: |
87 | - next_chks = set() |
88 | - for record in store.get_record_stream(chks_to_read, 'unordered', False): |
89 | - # TODO: Handle 'absent' |
90 | - if pb is not None: |
91 | - pb.tick() |
92 | +class CHKMapDifference(object): |
93 | + """Iterate the stored pages and key,value pairs for (new - old). |
94 | + |
95 | + This class provides a generator over the stored CHK pages and the |
96 | + (key, value) pairs that are in any of the new maps and not in any of the |
97 | + old maps. |
98 | + |
99 | + Note that it may yield chk pages that are common (especially root nodes), |
100 | + but it won't yield (key,value) pairs that are common. |
101 | + """ |
102 | + |
103 | + def __init__(self, store, new_root_keys, old_root_keys, |
104 | + search_key_func, pb=None): |
105 | + self._store = store |
106 | + self._new_root_keys = new_root_keys |
107 | + self._old_root_keys = old_root_keys |
108 | + self._pb = pb |
109 | + # All uninteresting chks that we have seen. By the time they are added |
110 | + # here, they should be either fully ignored, or queued up for |
111 | + # processing |
112 | + self._all_old_chks = set(self._old_root_keys) |
113 | + # All items that we have seen from the old_root_keys |
114 | + self._all_old_items = set() |
115 | + # These are interesting items which were either read, or already in the |
116 | + # interesting queue (so we don't need to walk them again) |
117 | + self._processed_new_refs = set() |
118 | + self._search_key_func = search_key_func |
119 | + |
120 | + # The uninteresting and interesting nodes to be searched |
121 | + self._old_queue = [] |
122 | + self._new_queue = [] |
123 | + # Holds the (key, value) items found when processing the root nodes, |
124 | + # waiting for the uninteresting nodes to be walked |
125 | + self._new_item_queue = [] |
126 | + self._state = None |
127 | + |
128 | + def _read_nodes_from_store(self, keys): |
129 | + # We chose not to use _page_cache, because we think in terms of records |
130 | + # to be yielded. Also, we expect to touch each page only 1 time during |
131 | + # this code. (We may want to evaluate saving the raw bytes into the |
132 | + # page cache, which would allow a working tree update after the fetch |
133 | + # to not have to read the bytes again.) |
134 | + stream = self._store.get_record_stream(keys, 'unordered', True) |
135 | + for record in stream: |
136 | + if self._pb is not None: |
137 | + self._pb.tick() |
138 | + if record.storage_kind == 'absent': |
139 | + raise errors.NoSuchRevision(self._store, record.key) |
140 | bytes = record.get_bytes_as('fulltext') |
141 | - # We don't care about search_key_func for this code, because we |
142 | - # only care about external references. |
143 | - node = _deserialise(bytes, record.key, search_key_func=None) |
144 | + node = _deserialise(bytes, record.key, |
145 | + search_key_func=self._search_key_func) |
146 | if type(node) is InternalNode: |
147 | - # uninteresting_prefix_chks.update(node._items.iteritems()) |
148 | - chks = node._items.values() |
149 | - # TODO: We remove the entries that are already in |
150 | - # uninteresting_chks ? |
151 | - next_chks.update(chks) |
152 | - all_uninteresting_chks.update(chks) |
153 | + # Note we don't have to do node.refs() because we know that |
154 | + # there are no children that have been pushed into this node |
155 | + prefix_refs = node._items.items() |
156 | + items = [] |
157 | else: |
158 | - all_uninteresting_items.update(node._items.iteritems()) |
159 | - chks_to_read = next_chks |
160 | - return (all_uninteresting_chks, all_uninteresting_items, |
161 | - interesting_keys, interesting_to_yield, interesting_items) |
162 | + prefix_refs = [] |
163 | + items = node._items.items() |
164 | + yield record, node, prefix_refs, items |
165 | + |
166 | + def _read_old_roots(self): |
167 | + old_chks_to_enqueue = [] |
168 | + all_old_chks = self._all_old_chks |
169 | + for record, node, prefix_refs, items in \ |
170 | + self._read_nodes_from_store(self._old_root_keys): |
171 | + # Uninteresting node |
172 | + prefix_refs = [p_r for p_r in prefix_refs |
173 | + if p_r[1] not in all_old_chks] |
174 | + new_refs = [p_r[1] for p_r in prefix_refs] |
175 | + all_old_chks.update(new_refs) |
176 | + self._all_old_items.update(items) |
177 | + # Queue up the uninteresting references |
178 | + # Don't actually put them in the 'to-read' queue until we have |
179 | + # finished checking the interesting references |
180 | + old_chks_to_enqueue.extend(prefix_refs) |
181 | + return old_chks_to_enqueue |
182 | + |
183 | + def _enqueue_old(self, new_prefixes, old_chks_to_enqueue): |
184 | + # At this point, we have read all the uninteresting and interesting |
185 | + # items, so we can queue up the uninteresting stuff, knowing that we've |
186 | + # handled the interesting ones |
187 | + for prefix, ref in old_chks_to_enqueue: |
188 | + not_interesting = True |
189 | + for i in xrange(len(prefix), 0, -1): |
190 | + if prefix[:i] in new_prefixes: |
191 | + not_interesting = False |
192 | + break |
193 | + if not_interesting: |
194 | + # This prefix is not part of the remaining 'interesting set' |
195 | + continue |
196 | + self._old_queue.append(ref) |
197 | + |
198 | + def _read_all_roots(self): |
199 | + """Read the root pages. |
200 | + |
201 | + This is structured as a generator, so that the root records can be |
202 | + yielded up to whoever needs them without any buffering. |
203 | + """ |
204 | + # This is the bootstrap phase |
205 | + if not self._old_root_keys: |
206 | + # With no old_root_keys we can just shortcut and be ready |
207 | + # for _flush_new_queue |
208 | + self._new_queue = list(self._new_root_keys) |
209 | + return |
210 | + old_chks_to_enqueue = self._read_old_roots() |
211 | + # filter out any root keys that are already known to be uninteresting |
212 | + new_keys = set(self._new_root_keys).difference(self._all_old_chks) |
213 | + # These are prefixes that are present in new_keys that we are |
214 | + # thinking to yield |
215 | + new_prefixes = set() |
216 | + # We are about to yield all of these, so we don't want them getting |
217 | + # added a second time |
218 | + processed_new_refs = self._processed_new_refs |
219 | + processed_new_refs.update(new_keys) |
220 | + for record, node, prefix_refs, items in \ |
221 | + self._read_nodes_from_store(new_keys): |
222 | + # At this level, we now know all the uninteresting references |
223 | + # So we filter and queue up whatever is remaining |
224 | + prefix_refs = [p_r for p_r in prefix_refs |
225 | + if p_r[1] not in self._all_old_chks |
226 | + and p_r[1] not in processed_new_refs] |
227 | + refs = [p_r[1] for p_r in prefix_refs] |
228 | + new_prefixes.update([p_r[0] for p_r in prefix_refs]) |
229 | + self._new_queue.extend(refs) |
230 | + # TODO: We can potentially get multiple items here, however the |
231 | + # current design allows for this, as callers will do the work |
232 | + # to make the results unique. We might profile whether we |
233 | + # gain anything by ensuring unique return values for items |
234 | + new_items = [item for item in items |
235 | + if item not in self._all_old_items] |
236 | + self._new_item_queue.extend(new_items) |
237 | + new_prefixes.update([self._search_key_func(item[0]) |
238 | + for item in new_items]) |
239 | + processed_new_refs.update(refs) |
240 | + yield record |
241 | + # For new_prefixes we have the full length prefixes queued up. |
242 | + # However, we also need possible prefixes. (If we have a known ref to |
243 | + # 'ab', then we also need to include 'a'.) So expand the |
244 | + # new_prefixes to include all shorter prefixes |
245 | + for prefix in list(new_prefixes): |
246 | + new_prefixes.update([prefix[:i] for i in xrange(1, len(prefix))]) |
247 | + self._enqueue_old(new_prefixes, old_chks_to_enqueue) |
248 | + |
249 | + def _flush_new_queue(self): |
250 | + # No need to maintain the heap invariant anymore, just pull things out |
251 | + # and process them |
252 | + refs = set(self._new_queue) |
253 | + self._new_queue = [] |
254 | + # First pass, flush all interesting items and convert to using direct refs |
255 | + all_old_chks = self._all_old_chks |
256 | + processed_new_refs = self._processed_new_refs |
257 | + all_old_items = self._all_old_items |
258 | + new_items = [item for item in self._new_item_queue |
259 | + if item not in all_old_items] |
260 | + self._new_item_queue = [] |
261 | + if new_items: |
262 | + yield None, new_items |
263 | + refs = refs.difference(all_old_chks) |
264 | + while refs: |
265 | + next_refs = set() |
266 | + next_refs_update = next_refs.update |
267 | + # Inlining _read_nodes_from_store improves 'bzr branch bzr.dev' |
268 | + # from 1m54s to 1m51s. Consider it. |
269 | + for record, _, p_refs, items in self._read_nodes_from_store(refs): |
270 | + items = [item for item in items |
271 | + if item not in all_old_items] |
272 | + yield record, items |
273 | + next_refs_update([p_r[1] for p_r in p_refs]) |
274 | + next_refs = next_refs.difference(all_old_chks) |
275 | + next_refs = next_refs.difference(processed_new_refs) |
276 | + processed_new_refs.update(next_refs) |
277 | + refs = next_refs |
278 | + |
279 | + def _process_next_old(self): |
280 | + # Since we don't filter uninteresting any further than during |
281 | + # _read_all_roots, process the whole queue in a single pass. |
282 | + refs = self._old_queue |
283 | + self._old_queue = [] |
284 | + all_old_chks = self._all_old_chks |
285 | + for record, _, prefix_refs, items in self._read_nodes_from_store(refs): |
286 | + self._all_old_items.update(items) |
287 | + refs = [r for _,r in prefix_refs if r not in all_old_chks] |
288 | + self._old_queue.extend(refs) |
289 | + all_old_chks.update(refs) |
290 | + |
291 | + def _process_queues(self): |
292 | + while self._old_queue: |
293 | + self._process_next_old() |
294 | + return self._flush_new_queue() |
295 | + |
296 | + def process(self): |
297 | + for record in self._read_all_roots(): |
298 | + yield record, [] |
299 | + for record, items in self._process_queues(): |
300 | + yield record, items |
301 | |
302 | |
303 | def iter_interesting_nodes(store, interesting_root_keys, |
304 | @@ -1512,72 +1608,11 @@ |
305 | :return: Yield |
306 | (interesting record, {interesting key:values}) |
307 | """ |
308 | - # TODO: consider that it may be more memory efficient to use the 20-byte |
309 | - # sha1 string, rather than tuples of hexidecimal sha1 strings. |
310 | - # TODO: Try to factor out a lot of the get_record_stream() calls into a |
311 | - # helper function similar to _read_bytes. This function should be |
312 | - # able to use nodes from the _page_cache as well as actually |
313 | - # requesting bytes from the store. |
314 | - |
315 | - (all_uninteresting_chks, all_uninteresting_items, interesting_keys, |
316 | - interesting_to_yield, interesting_items) = _find_all_uninteresting(store, |
317 | - interesting_root_keys, uninteresting_root_keys, pb) |
318 | - |
319 | - # Now that we know everything uninteresting, we can yield information from |
320 | - # our first request |
321 | - interesting_items.difference_update(all_uninteresting_items) |
322 | - interesting_to_yield = set(interesting_to_yield) - all_uninteresting_chks |
323 | - if interesting_items: |
324 | - yield None, interesting_items |
325 | - if interesting_to_yield: |
326 | - # We request these records again, rather than buffering the root |
327 | - # records, most likely they are still in the _group_cache anyway. |
328 | - for record in store.get_record_stream(interesting_to_yield, |
329 | - 'unordered', False): |
330 | - yield record, [] |
331 | - all_uninteresting_chks.update(interesting_to_yield) |
332 | - interesting_keys.difference_update(all_uninteresting_chks) |
333 | - |
334 | - chks_to_read = interesting_keys |
335 | - counter = 0 |
336 | - while chks_to_read: |
337 | - next_chks = set() |
338 | - for record in store.get_record_stream(chks_to_read, 'unordered', False): |
339 | - counter += 1 |
340 | - if pb is not None: |
341 | - pb.update('find chk pages', counter) |
342 | - # TODO: Handle 'absent'? |
343 | - bytes = record.get_bytes_as('fulltext') |
344 | - # We don't care about search_key_func for this code, because we |
345 | - # only care about external references. |
346 | - node = _deserialise(bytes, record.key, search_key_func=None) |
347 | - if type(node) is InternalNode: |
348 | - # all_uninteresting_chks grows large, as it lists all nodes we |
349 | - # don't want to process (including already seen interesting |
350 | - # nodes). |
351 | - # small.difference_update(large) scales O(large), but |
352 | - # small.difference(large) scales O(small). |
353 | - # Also, we know we just _deserialised this node, so we can |
354 | - # access the dict directly. |
355 | - chks = set(node._items.itervalues()).difference( |
356 | - all_uninteresting_chks) |
357 | - # Is set() and .difference_update better than: |
358 | - # chks = [chk for chk in node.refs() |
359 | - # if chk not in all_uninteresting_chks] |
360 | - next_chks.update(chks) |
361 | - # These are now uninteresting everywhere else |
362 | - all_uninteresting_chks.update(chks) |
363 | - interesting_items = [] |
364 | - else: |
365 | - interesting_items = [item for item in node._items.iteritems() |
366 | - if item not in all_uninteresting_items] |
367 | - # TODO: Do we need to filter out items that we have already |
368 | - # seen on other pages? We don't really want to buffer the |
369 | - # whole thing, but it does mean that callers need to |
370 | - # understand they may get duplicate values. |
371 | - # all_uninteresting_items.update(interesting_items) |
372 | - yield record, interesting_items |
373 | - chks_to_read = next_chks |
374 | + iterator = CHKMapDifference(store, interesting_root_keys, |
375 | + uninteresting_root_keys, |
376 | + search_key_func=store._search_key_func, |
377 | + pb=pb) |
378 | + return iterator.process() |
379 | |
380 | |
381 | try: |
382 | |
383 | === modified file 'bzrlib/repofmt/groupcompress_repo.py' |
384 | --- bzrlib/repofmt/groupcompress_repo.py 2009-07-06 14:47:36 +0000 |
385 | +++ bzrlib/repofmt/groupcompress_repo.py 2009-07-09 07:36:30 +0000 |
386 | @@ -650,6 +650,9 @@ |
387 | parents=False, is_locked=self.is_locked, |
388 | inconsistency_fatal=False), |
389 | access=self._pack_collection.chk_index.data_access) |
390 | + search_key_name = self._format._serializer.search_key_name |
391 | + search_key_func = chk_map.search_key_registry.get(search_key_name) |
392 | + self.chk_bytes._search_key_func = search_key_func |
393 | # True when the repository object is 'write locked' (as opposed to the |
394 | # physical lock only taken out around changes to the pack-names list.) |
395 | # Another way to represent this would be a decorator around the control |
396 | |
397 | === modified file 'bzrlib/tests/test_chk_map.py' |
398 | --- bzrlib/tests/test_chk_map.py 2009-06-26 09:24:34 +0000 |
399 | +++ bzrlib/tests/test_chk_map.py 2009-07-09 07:36:31 +0000 |
400 | @@ -20,6 +20,7 @@ |
401 | |
402 | from bzrlib import ( |
403 | chk_map, |
404 | + groupcompress, |
405 | osutils, |
406 | tests, |
407 | ) |
408 | @@ -59,17 +60,14 @@ |
409 | self.assertCommonPrefix('', '', '') |
410 | |
411 | |
412 | -class TestCaseWithStore(tests.TestCaseWithTransport): |
413 | +class TestCaseWithStore(tests.TestCaseWithMemoryTransport): |
414 | |
415 | def get_chk_bytes(self): |
416 | # The easiest way to get a CHK store is a development6 repository and |
417 | # then work with the chk_bytes attribute directly. |
418 | - repo = self.make_repository(".", format="development6-rich-root") |
419 | - repo.lock_write() |
420 | - self.addCleanup(repo.unlock) |
421 | - repo.start_write_group() |
422 | - self.addCleanup(repo.abort_write_group) |
423 | - return repo.chk_bytes |
424 | + factory = groupcompress.make_pack_factory(False, False, 1) |
425 | + self.chk_bytes = factory(self.get_transport()) |
426 | + return self.chk_bytes |
427 | |
428 | def _get_map(self, a_dict, maximum_size=0, chk_bytes=None, key_width=1, |
429 | search_key_func=None): |
430 | @@ -97,6 +95,246 @@ |
431 | return dict(node.iteritems(*args)) |
432 | |
433 | |
434 | +class TestCaseWithExampleMaps(TestCaseWithStore): |
435 | + |
436 | + def get_chk_bytes(self): |
437 | + if getattr(self, '_chk_bytes', None) is None: |
438 | + self._chk_bytes = super(TestCaseWithExampleMaps, |
439 | + self).get_chk_bytes() |
440 | + return self._chk_bytes |
441 | + |
442 | + def get_map(self, a_dict, maximum_size=100, search_key_func=None): |
443 | + c_map = self._get_map(a_dict, maximum_size=maximum_size, |
444 | + chk_bytes=self.get_chk_bytes(), |
445 | + search_key_func=search_key_func) |
446 | + return c_map |
447 | + |
448 | + def make_root_only_map(self, search_key_func=None): |
449 | + return self.get_map({ |
450 | + ('aaa',): 'initial aaa content', |
451 | + ('abb',): 'initial abb content', |
452 | + }, search_key_func=search_key_func) |
453 | + |
454 | + def make_root_only_aaa_ddd_map(self, search_key_func=None): |
455 | + return self.get_map({ |
456 | + ('aaa',): 'initial aaa content', |
457 | + ('ddd',): 'initial ddd content', |
458 | + }, search_key_func=search_key_func) |
459 | + |
460 | + def make_one_deep_map(self, search_key_func=None): |
461 | + # Same as root_only_map, except it forces an InternalNode at the root |
462 | + return self.get_map({ |
463 | + ('aaa',): 'initial aaa content', |
464 | + ('abb',): 'initial abb content', |
465 | + ('ccc',): 'initial ccc content', |
466 | + ('ddd',): 'initial ddd content', |
467 | + }, search_key_func=search_key_func) |
468 | + |
469 | + def make_two_deep_map(self, search_key_func=None): |
470 | + # Carefully chosen so that it creates a 2-deep map for both |
471 | + # _search_key_plain and for _search_key_16 |
472 | + # Also so that things line up with make_one_deep_two_prefix_map |
473 | + return self.get_map({ |
474 | + ('aaa',): 'initial aaa content', |
475 | + ('abb',): 'initial abb content', |
476 | + ('acc',): 'initial acc content', |
477 | + ('ace',): 'initial ace content', |
478 | + ('add',): 'initial add content', |
479 | + ('adh',): 'initial adh content', |
480 | + ('adl',): 'initial adl content', |
481 | + ('ccc',): 'initial ccc content', |
482 | + ('ddd',): 'initial ddd content', |
483 | + }, search_key_func=search_key_func) |
484 | + |
485 | + def make_one_deep_two_prefix_map(self, search_key_func=None): |
486 | + """Create a map with one internal node, but references are extra long. |
487 | + |
488 | + Otherwise has similar content to make_two_deep_map. |
489 | + """ |
490 | + return self.get_map({ |
491 | + ('aaa',): 'initial aaa content', |
492 | + ('add',): 'initial add content', |
493 | + ('adh',): 'initial adh content', |
494 | + ('adl',): 'initial adl content', |
495 | + }, search_key_func=search_key_func) |
496 | + |
497 | + def make_one_deep_one_prefix_map(self, search_key_func=None): |
498 | + """Create a map with one internal node, but references are extra long. |
499 | + |
500 | + Similar to make_one_deep_two_prefix_map, except the split is at the |
501 | + first char, rather than the second. |
502 | + """ |
503 | + return self.get_map({ |
504 | + ('add',): 'initial add content', |
505 | + ('adh',): 'initial adh content', |
506 | + ('adl',): 'initial adl content', |
507 | + ('bbb',): 'initial bbb content', |
508 | + }, search_key_func=search_key_func) |
509 | + |
510 | + |
511 | +class TestTestCaseWithExampleMaps(TestCaseWithExampleMaps): |
512 | + """Actual tests for the provided examples.""" |
513 | + |
514 | + def test_root_only_map_plain(self): |
515 | + c_map = self.make_root_only_map() |
516 | + self.assertEqualDiff( |
517 | + "'' LeafNode\n" |
518 | + " ('aaa',) 'initial aaa content'\n" |
519 | + " ('abb',) 'initial abb content'\n", |
520 | + c_map._dump_tree()) |
521 | + |
522 | + def test_root_only_map_16(self): |
523 | + c_map = self.make_root_only_map(search_key_func=chk_map._search_key_16) |
524 | + self.assertEqualDiff( |
525 | + "'' LeafNode\n" |
526 | + " ('aaa',) 'initial aaa content'\n" |
527 | + " ('abb',) 'initial abb content'\n", |
528 | + c_map._dump_tree()) |
529 | + |
530 | + def test_one_deep_map_plain(self): |
531 | + c_map = self.make_one_deep_map() |
532 | + self.assertEqualDiff( |
533 | + "'' InternalNode\n" |
534 | + " 'a' LeafNode\n" |
535 | + " ('aaa',) 'initial aaa content'\n" |
536 | + " ('abb',) 'initial abb content'\n" |
537 | + " 'c' LeafNode\n" |
538 | + " ('ccc',) 'initial ccc content'\n" |
539 | + " 'd' LeafNode\n" |
540 | + " ('ddd',) 'initial ddd content'\n", |
541 | + c_map._dump_tree()) |
542 | + |
543 | + def test_one_deep_map_16(self): |
544 | + c_map = self.make_one_deep_map(search_key_func=chk_map._search_key_16) |
545 | + self.assertEqualDiff( |
546 | + "'' InternalNode\n" |
547 | + " '2' LeafNode\n" |
548 | + " ('ccc',) 'initial ccc content'\n" |
549 | + " '4' LeafNode\n" |
550 | + " ('abb',) 'initial abb content'\n" |
551 | + " 'F' LeafNode\n" |
552 | + " ('aaa',) 'initial aaa content'\n" |
553 | + " ('ddd',) 'initial ddd content'\n", |
554 | + c_map._dump_tree()) |
555 | + |
556 | + def test_root_only_aaa_ddd_plain(self): |
557 | + c_map = self.make_root_only_aaa_ddd_map() |
558 | + self.assertEqualDiff( |
559 | + "'' LeafNode\n" |
560 | + " ('aaa',) 'initial aaa content'\n" |
561 | + " ('ddd',) 'initial ddd content'\n", |
562 | + c_map._dump_tree()) |
563 | + |
564 | + def test_one_deep_map_16(self): |
565 | + c_map = self.make_root_only_aaa_ddd_map( |
566 | + search_key_func=chk_map._search_key_16) |
567 | + # We use 'aaa' and 'ddd' because they happen to map to 'F' when using |
568 | + # _search_key_16 |
569 | + self.assertEqualDiff( |
570 | + "'' LeafNode\n" |
571 | + " ('aaa',) 'initial aaa content'\n" |
572 | + " ('ddd',) 'initial ddd content'\n", |
573 | + c_map._dump_tree()) |
574 | + |
575 | + def test_two_deep_map_plain(self): |
576 | + c_map = self.make_two_deep_map() |
577 | + self.assertEqualDiff( |
578 | + "'' InternalNode\n" |
579 | + " 'a' InternalNode\n" |
580 | + " 'aa' LeafNode\n" |
581 | + " ('aaa',) 'initial aaa content'\n" |
582 | + " 'ab' LeafNode\n" |
583 | + " ('abb',) 'initial abb content'\n" |
584 | + " 'ac' LeafNode\n" |
585 | + " ('acc',) 'initial acc content'\n" |
586 | + " ('ace',) 'initial ace content'\n" |
587 | + " 'ad' LeafNode\n" |
588 | + " ('add',) 'initial add content'\n" |
589 | + " ('adh',) 'initial adh content'\n" |
590 | + " ('adl',) 'initial adl content'\n" |
591 | + " 'c' LeafNode\n" |
592 | + " ('ccc',) 'initial ccc content'\n" |
593 | + " 'd' LeafNode\n" |
594 | + " ('ddd',) 'initial ddd content'\n", |
595 | + c_map._dump_tree()) |
596 | + |
597 | + def test_two_deep_map_16(self): |
598 | + c_map = self.make_two_deep_map(search_key_func=chk_map._search_key_16) |
599 | + self.assertEqualDiff( |
600 | + "'' InternalNode\n" |
601 | + " '2' LeafNode\n" |
602 | + " ('acc',) 'initial acc content'\n" |
603 | + " ('ccc',) 'initial ccc content'\n" |
604 | + " '4' LeafNode\n" |
605 | + " ('abb',) 'initial abb content'\n" |
606 | + " 'C' LeafNode\n" |
607 | + " ('ace',) 'initial ace content'\n" |
608 | + " 'F' InternalNode\n" |
609 | + " 'F0' LeafNode\n" |
610 | + " ('aaa',) 'initial aaa content'\n" |
611 | + " 'F3' LeafNode\n" |
612 | + " ('adl',) 'initial adl content'\n" |
613 | + " 'F4' LeafNode\n" |
614 | + " ('adh',) 'initial adh content'\n" |
615 | + " 'FB' LeafNode\n" |
616 | + " ('ddd',) 'initial ddd content'\n" |
617 | + " 'FD' LeafNode\n" |
618 | + " ('add',) 'initial add content'\n", |
619 | + c_map._dump_tree()) |
620 | + |
621 | + def test_one_deep_two_prefix_map_plain(self): |
622 | + c_map = self.make_one_deep_two_prefix_map() |
623 | + self.assertEqualDiff( |
624 | + "'' InternalNode\n" |
625 | + " 'aa' LeafNode\n" |
626 | + " ('aaa',) 'initial aaa content'\n" |
627 | + " 'ad' LeafNode\n" |
628 | + " ('add',) 'initial add content'\n" |
629 | + " ('adh',) 'initial adh content'\n" |
630 | + " ('adl',) 'initial adl content'\n", |
631 | + c_map._dump_tree()) |
632 | + |
633 | + def test_one_deep_two_prefix_map_16(self): |
634 | + c_map = self.make_one_deep_two_prefix_map( |
635 | + search_key_func=chk_map._search_key_16) |
636 | + self.assertEqualDiff( |
637 | + "'' InternalNode\n" |
638 | + " 'F0' LeafNode\n" |
639 | + " ('aaa',) 'initial aaa content'\n" |
640 | + " 'F3' LeafNode\n" |
641 | + " ('adl',) 'initial adl content'\n" |
642 | + " 'F4' LeafNode\n" |
643 | + " ('adh',) 'initial adh content'\n" |
644 | + " 'FD' LeafNode\n" |
645 | + " ('add',) 'initial add content'\n", |
646 | + c_map._dump_tree()) |
647 | + |
648 | + def test_one_deep_one_prefix_map_plain(self): |
649 | + c_map = self.make_one_deep_one_prefix_map() |
650 | + self.assertEqualDiff( |
651 | + "'' InternalNode\n" |
652 | + " 'a' LeafNode\n" |
653 | + " ('add',) 'initial add content'\n" |
654 | + " ('adh',) 'initial adh content'\n" |
655 | + " ('adl',) 'initial adl content'\n" |
656 | + " 'b' LeafNode\n" |
657 | + " ('bbb',) 'initial bbb content'\n", |
658 | + c_map._dump_tree()) |
659 | + |
660 | + def test_one_deep_one_prefix_map_16(self): |
661 | + c_map = self.make_one_deep_one_prefix_map( |
662 | + search_key_func=chk_map._search_key_16) |
663 | + self.assertEqualDiff( |
664 | + "'' InternalNode\n" |
665 | + " '4' LeafNode\n" |
666 | + " ('bbb',) 'initial bbb content'\n" |
667 | + " 'F' LeafNode\n" |
668 | + " ('add',) 'initial add content'\n" |
669 | + " ('adh',) 'initial adh content'\n" |
670 | + " ('adl',) 'initial adl content'\n", |
671 | + c_map._dump_tree()) |
672 | + |
673 | + |
674 | class TestMap(TestCaseWithStore): |
675 | |
676 | def assertHasABMap(self, chk_bytes): |
677 | @@ -1886,62 +2124,406 @@ |
678 | # 1-4K get0 |
679 | |
680 | |
681 | -class TestIterInterestingNodes(TestCaseWithStore): |
682 | - |
683 | - def get_chk_bytes(self): |
684 | - if getattr(self, '_chk_bytes', None) is None: |
685 | - self._chk_bytes = super(TestIterInterestingNodes, |
686 | - self).get_chk_bytes() |
687 | - return self._chk_bytes |
688 | +class TestCHKMapDifference(TestCaseWithExampleMaps): |
689 | + |
690 | + def get_difference(self, new_roots, old_roots, |
691 | + search_key_func=None): |
692 | + if search_key_func is None: |
693 | + search_key_func = chk_map._search_key_plain |
694 | + return chk_map.CHKMapDifference(self.get_chk_bytes(), |
695 | + new_roots, old_roots, search_key_func) |
696 | + |
697 | + def test__init__(self): |
698 | + c_map = self.make_root_only_map() |
699 | + key1 = c_map.key() |
700 | + c_map.map(('aaa',), 'new aaa content') |
701 | + key2 = c_map._save() |
702 | + diff = self.get_difference([key2], [key1]) |
703 | + self.assertEqual(set([key1]), diff._all_old_chks) |
704 | + self.assertEqual([], diff._old_queue) |
705 | + self.assertEqual([], diff._new_queue) |
706 | + |
707 | + def help__read_all_roots(self, search_key_func): |
708 | + c_map = self.make_root_only_map(search_key_func=search_key_func) |
709 | + key1 = c_map.key() |
710 | + c_map.map(('aaa',), 'new aaa content') |
711 | + key2 = c_map._save() |
712 | + diff = self.get_difference([key2], [key1], search_key_func) |
713 | + root_results = [record.key for record in diff._read_all_roots()] |
714 | + self.assertEqual([key2], root_results) |
715 | + # We should have queued up only items that aren't in the old |
716 | + # set |
717 | + self.assertEqual([(('aaa',), 'new aaa content')], |
718 | + diff._new_item_queue) |
719 | + self.assertEqual([], diff._new_queue) |
720 | + # And there are no old references, so that queue should be |
721 | + # empty |
722 | + self.assertEqual([], diff._old_queue) |
723 | + |
724 | + def test__read_all_roots_plain(self): |
725 | + self.help__read_all_roots(search_key_func=chk_map._search_key_plain) |
726 | + |
727 | + def test__read_all_roots_16(self): |
728 | + self.help__read_all_roots(search_key_func=chk_map._search_key_16) |
729 | + |
730 | + def test__read_all_roots_skips_known_old(self): |
731 | + c_map = self.make_one_deep_map(chk_map._search_key_plain) |
732 | + key1 = c_map.key() |
733 | + c_map2 = self.make_root_only_map(chk_map._search_key_plain) |
734 | + key2 = c_map2.key() |
735 | + diff = self.get_difference([key2], [key1], chk_map._search_key_plain) |
736 | + root_results = [record.key for record in diff._read_all_roots()] |
737 | + # We should have no results. key2 is completely contained within key1, |
738 | + # and we should have seen that in the first pass |
739 | + self.assertEqual([], root_results) |
740 | + |
741 | + def test__read_all_roots_prepares_queues(self): |
742 | + c_map = self.make_one_deep_map(chk_map._search_key_plain) |
743 | + key1 = c_map.key() |
744 | + c_map._dump_tree() # load everything |
745 | + key1_a = c_map._root_node._items['a'].key() |
746 | + c_map.map(('abb',), 'new abb content') |
747 | + key2 = c_map._save() |
748 | + key2_a = c_map._root_node._items['a'].key() |
749 | + diff = self.get_difference([key2], [key1], chk_map._search_key_plain) |
750 | + root_results = [record.key for record in diff._read_all_roots()] |
751 | + self.assertEqual([key2], root_results) |
752 | + # At this point, we should have queued up only the 'a' Leaf on both |
753 | + # sides, both 'c' and 'd' are known to not have changed on both sides |
754 | + self.assertEqual([key2_a], diff._new_queue) |
755 | + self.assertEqual([], diff._new_item_queue) |
756 | + self.assertEqual([key1_a], diff._old_queue) |
757 | + |
758 | + def test__read_all_roots_multi_new_prepares_queues(self): |
759 | + c_map = self.make_one_deep_map(chk_map._search_key_plain) |
760 | + key1 = c_map.key() |
761 | + c_map._dump_tree() # load everything |
762 | + key1_a = c_map._root_node._items['a'].key() |
763 | + key1_c = c_map._root_node._items['c'].key() |
764 | + c_map.map(('abb',), 'new abb content') |
765 | + key2 = c_map._save() |
766 | + key2_a = c_map._root_node._items['a'].key() |
767 | + key2_c = c_map._root_node._items['c'].key() |
768 | + c_map = chk_map.CHKMap(self.get_chk_bytes(), key1, |
769 | + chk_map._search_key_plain) |
770 | + c_map.map(('ccc',), 'new ccc content') |
771 | + key3 = c_map._save() |
772 | + key3_a = c_map._root_node._items['a'].key() |
773 | + key3_c = c_map._root_node._items['c'].key() |
774 | + diff = self.get_difference([key2, key3], [key1], |
775 | + chk_map._search_key_plain) |
776 | + root_results = [record.key for record in diff._read_all_roots()] |
777 | + self.assertEqual(sorted([key2, key3]), sorted(root_results)) |
778 | + # We should have queued up key2_a, and key3_c, but not key2_c or key3_c |
779 | + self.assertEqual([key2_a, key3_c], diff._new_queue) |
780 | + self.assertEqual([], diff._new_item_queue) |
781 | + # And we should have queued up both a and c for the old set |
782 | + self.assertEqual([key1_a, key1_c], diff._old_queue) |
783 | + |
784 | + def test__read_all_roots_different_depths(self): |
785 | + c_map = self.make_two_deep_map(chk_map._search_key_plain) |
786 | + c_map._dump_tree() # load everything |
787 | + key1 = c_map.key() |
788 | + key1_a = c_map._root_node._items['a'].key() |
789 | + key1_c = c_map._root_node._items['c'].key() |
790 | + key1_d = c_map._root_node._items['d'].key() |
791 | + |
792 | + c_map2 = self.make_one_deep_two_prefix_map(chk_map._search_key_plain) |
793 | + c_map2._dump_tree() |
794 | + key2 = c_map2.key() |
795 | + key2_aa = c_map2._root_node._items['aa'].key() |
796 | + key2_ad = c_map2._root_node._items['ad'].key() |
797 | + |
798 | + diff = self.get_difference([key2], [key1], chk_map._search_key_plain) |
799 | + root_results = [record.key for record in diff._read_all_roots()] |
800 | + self.assertEqual([key2], root_results) |
801 | + # Only the 'a' subset should be queued up, since 'c' and 'd' cannot be |
802 | + # present |
803 | + self.assertEqual([key1_a], diff._old_queue) |
804 | + self.assertEqual([key2_aa, key2_ad], diff._new_queue) |
805 | + self.assertEqual([], diff._new_item_queue) |
806 | + |
807 | + diff = self.get_difference([key1], [key2], chk_map._search_key_plain) |
808 | + root_results = [record.key for record in diff._read_all_roots()] |
809 | + self.assertEqual([key1], root_results) |
810 | + |
811 | + self.assertEqual([key2_aa, key2_ad], diff._old_queue) |
812 | + self.assertEqual([key1_a, key1_c, key1_d], diff._new_queue) |
813 | + self.assertEqual([], diff._new_item_queue) |
814 | + |
815 | + def test__read_all_roots_different_depths_16(self): |
816 | + c_map = self.make_two_deep_map(chk_map._search_key_16) |
817 | + c_map._dump_tree() # load everything |
818 | + key1 = c_map.key() |
819 | + key1_2 = c_map._root_node._items['2'].key() |
820 | + key1_4 = c_map._root_node._items['4'].key() |
821 | + key1_C = c_map._root_node._items['C'].key() |
822 | + key1_F = c_map._root_node._items['F'].key() |
823 | + |
824 | + c_map2 = self.make_one_deep_two_prefix_map(chk_map._search_key_16) |
825 | + c_map2._dump_tree() |
826 | + key2 = c_map2.key() |
827 | + key2_F0 = c_map2._root_node._items['F0'].key() |
828 | + key2_F3 = c_map2._root_node._items['F3'].key() |
829 | + key2_F4 = c_map2._root_node._items['F4'].key() |
830 | + key2_FD = c_map2._root_node._items['FD'].key() |
831 | + |
832 | + diff = self.get_difference([key2], [key1], chk_map._search_key_16) |
833 | + root_results = [record.key for record in diff._read_all_roots()] |
834 | + self.assertEqual([key2], root_results) |
835 | + # Only the subset of keys that may be present should be queued up. |
836 | + self.assertEqual([key1_F], diff._old_queue) |
837 | + self.assertEqual(sorted([key2_F0, key2_F3, key2_F4, key2_FD]), |
838 | + sorted(diff._new_queue)) |
839 | + self.assertEqual([], diff._new_item_queue) |
840 | + |
841 | + diff = self.get_difference([key1], [key2], chk_map._search_key_16) |
842 | + root_results = [record.key for record in diff._read_all_roots()] |
843 | + self.assertEqual([key1], root_results) |
844 | + |
845 | + self.assertEqual(sorted([key2_F0, key2_F3, key2_F4, key2_FD]), |
846 | + sorted(diff._old_queue)) |
847 | + self.assertEqual(sorted([key1_2, key1_4, key1_C, key1_F]), |
848 | + sorted(diff._new_queue)) |
849 | + self.assertEqual([], diff._new_item_queue) |
850 | + |
851 | + def test__read_all_roots_mixed_depth(self): |
852 | + c_map = self.make_one_deep_two_prefix_map(chk_map._search_key_plain) |
853 | + c_map._dump_tree() # load everything |
854 | + key1 = c_map.key() |
855 | + key1_aa = c_map._root_node._items['aa'].key() |
856 | + key1_ad = c_map._root_node._items['ad'].key() |
857 | + |
858 | + c_map2 = self.make_one_deep_one_prefix_map(chk_map._search_key_plain) |
859 | + c_map2._dump_tree() |
860 | + key2 = c_map2.key() |
861 | + key2_a = c_map2._root_node._items['a'].key() |
862 | + key2_b = c_map2._root_node._items['b'].key() |
863 | + |
864 | + diff = self.get_difference([key2], [key1], chk_map._search_key_plain) |
865 | + root_results = [record.key for record in diff._read_all_roots()] |
866 | + self.assertEqual([key2], root_results) |
867 | + # 'ad' matches exactly 'a' on the other side, so it should be removed, |
868 | + # and neither side should have it queued for walking |
869 | + self.assertEqual([], diff._old_queue) |
870 | + self.assertEqual([key2_b], diff._new_queue) |
871 | + self.assertEqual([], diff._new_item_queue) |
872 | + |
873 | + diff = self.get_difference([key1], [key2], chk_map._search_key_plain) |
874 | + root_results = [record.key for record in diff._read_all_roots()] |
875 | + self.assertEqual([key1], root_results) |
876 | + # Note: This is technically not the 'true minimal' set that we could |
877 | + # use The reason is that 'a' was matched exactly to 'ad' (by sha |
878 | + # sum). However, the code gets complicated in the case of more |
879 | + # than one interesting key, so for now, we live with this |
880 | + # Consider revising, though benchmarking showing it to be a |
881 | + # real-world issue should be done |
882 | + self.assertEqual([key2_a], diff._old_queue) |
883 | + # self.assertEqual([], diff._old_queue) |
884 | + self.assertEqual([key1_aa], diff._new_queue) |
885 | + self.assertEqual([], diff._new_item_queue) |
886 | + |
887 | + def test__read_all_roots_yields_extra_deep_records(self): |
888 | + # This is slightly controversial, as we will yield a chk page that we |
889 | + # might later on find out could be filtered out. (If a root node is |
890 | + # referenced deeper in the old set.) |
891 | + # However, even with stacking, we always have all chk pages that we |
892 | + # will need. So as long as we filter out the referenced keys, we'll |
893 | + # never run into problems. |
894 | + # This allows us to yield a root node record immediately, without any |
895 | + # buffering. |
896 | + c_map = self.make_two_deep_map(chk_map._search_key_plain) |
897 | + c_map._dump_tree() # load all keys |
898 | + key1 = c_map.key() |
899 | + key1_a = c_map._root_node._items['a'].key() |
900 | + c_map2 = self.get_map({ |
901 | + ('acc',): 'initial acc content', |
902 | + ('ace',): 'initial ace content', |
903 | + }, maximum_size=100) |
904 | + self.assertEqualDiff( |
905 | + "'' LeafNode\n" |
906 | + " ('acc',) 'initial acc content'\n" |
907 | + " ('ace',) 'initial ace content'\n", |
908 | + c_map2._dump_tree()) |
909 | + key2 = c_map2.key() |
910 | + diff = self.get_difference([key2], [key1], chk_map._search_key_plain) |
911 | + root_results = [record.key for record in diff._read_all_roots()] |
912 | + self.assertEqual([key2], root_results) |
913 | + # However, even though we have yielded the root node to be fetched, |
914 | + # we should have enqued all of the chk pages to be walked, so that we |
915 | + # can find the keys if they are present |
916 | + self.assertEqual([key1_a], diff._old_queue) |
917 | + self.assertEqual([(('acc',), 'initial acc content'), |
918 | + (('ace',), 'initial ace content'), |
919 | + ], diff._new_item_queue) |
920 | + |
921 | + def test__read_all_roots_multiple_targets(self): |
922 | + c_map = self.make_root_only_map() |
923 | + key1 = c_map.key() |
924 | + c_map = self.make_one_deep_map() |
925 | + key2 = c_map.key() |
926 | + c_map._dump_tree() |
927 | + key2_c = c_map._root_node._items['c'].key() |
928 | + key2_d = c_map._root_node._items['d'].key() |
929 | + c_map.map(('ccc',), 'new ccc value') |
930 | + key3 = c_map._save() |
931 | + key3_c = c_map._root_node._items['c'].key() |
932 | + diff = self.get_difference([key2, key3], [key1], |
933 | + chk_map._search_key_plain) |
934 | + root_results = [record.key for record in diff._read_all_roots()] |
935 | + self.assertEqual(sorted([key2, key3]), sorted(root_results)) |
936 | + self.assertEqual([], diff._old_queue) |
937 | + # the key 'd' is interesting from key2 and key3, but should only be |
938 | + # entered into the queue 1 time |
939 | + self.assertEqual(sorted([key2_c, key3_c, key2_d]), |
940 | + sorted(diff._new_queue)) |
941 | + self.assertEqual([], diff._new_item_queue) |
942 | + |
943 | + def test__read_all_roots_no_old(self): |
944 | + # This is the 'initial branch' case. With nothing in the old |
945 | + # set, we can just queue up all root nodes into interesting queue, and |
946 | + # then have them fast-path flushed via _flush_new_queue |
947 | + c_map = self.make_two_deep_map() |
948 | + key1 = c_map.key() |
949 | + diff = self.get_difference([key1], [], chk_map._search_key_plain) |
950 | + root_results = [record.key for record in diff._read_all_roots()] |
951 | + self.assertEqual([], root_results) |
952 | + self.assertEqual([], diff._old_queue) |
953 | + self.assertEqual([key1], diff._new_queue) |
954 | + self.assertEqual([], diff._new_item_queue) |
955 | + |
956 | + c_map2 = self.make_one_deep_map() |
957 | + key2 = c_map2.key() |
958 | + diff = self.get_difference([key1, key2], [], chk_map._search_key_plain) |
959 | + root_results = [record.key for record in diff._read_all_roots()] |
960 | + self.assertEqual([], root_results) |
961 | + self.assertEqual([], diff._old_queue) |
962 | + self.assertEqual(sorted([key1, key2]), sorted(diff._new_queue)) |
963 | + self.assertEqual([], diff._new_item_queue) |
964 | + |
965 | + def test__read_all_roots_no_old_16(self): |
966 | + c_map = self.make_two_deep_map(chk_map._search_key_16) |
967 | + key1 = c_map.key() |
968 | + diff = self.get_difference([key1], [], chk_map._search_key_16) |
969 | + root_results = [record.key for record in diff._read_all_roots()] |
970 | + self.assertEqual([], root_results) |
971 | + self.assertEqual([], diff._old_queue) |
972 | + self.assertEqual([key1], diff._new_queue) |
973 | + self.assertEqual([], diff._new_item_queue) |
974 | + |
975 | + c_map2 = self.make_one_deep_map(chk_map._search_key_16) |
976 | + key2 = c_map2.key() |
977 | + diff = self.get_difference([key1, key2], [], |
978 | + chk_map._search_key_16) |
979 | + root_results = [record.key for record in diff._read_all_roots()] |
980 | + self.assertEqual([], root_results) |
981 | + self.assertEqual([], diff._old_queue) |
982 | + self.assertEqual(sorted([key1, key2]), |
983 | + sorted(diff._new_queue)) |
984 | + self.assertEqual([], diff._new_item_queue) |
985 | + |
986 | + def test__read_all_roots_multiple_old(self): |
987 | + c_map = self.make_two_deep_map() |
988 | + key1 = c_map.key() |
989 | + c_map._dump_tree() # load everything |
990 | + key1_a = c_map._root_node._items['a'].key() |
991 | + c_map.map(('ccc',), 'new ccc value') |
992 | + key2 = c_map._save() |
993 | + key2_a = c_map._root_node._items['a'].key() |
994 | + c_map.map(('add',), 'new add value') |
995 | + key3 = c_map._save() |
996 | + key3_a = c_map._root_node._items['a'].key() |
997 | + diff = self.get_difference([key3], [key1, key2], |
998 | + chk_map._search_key_plain) |
999 | + root_results = [record.key for record in diff._read_all_roots()] |
1000 | + self.assertEqual([key3], root_results) |
1001 | + # the 'a' keys should not be queued up 2 times, since they are |
1002 | + # identical |
1003 | + self.assertEqual([key1_a], diff._old_queue) |
1004 | + self.assertEqual([key3_a], diff._new_queue) |
1005 | + self.assertEqual([], diff._new_item_queue) |
1006 | + |
1007 | + def test__process_next_old_batched_no_dupes(self): |
1008 | + c_map = self.make_two_deep_map() |
1009 | + key1 = c_map.key() |
1010 | + c_map._dump_tree() # load everything |
1011 | + key1_a = c_map._root_node._items['a'].key() |
1012 | + key1_aa = c_map._root_node._items['a']._items['aa'].key() |
1013 | + key1_ab = c_map._root_node._items['a']._items['ab'].key() |
1014 | + key1_ac = c_map._root_node._items['a']._items['ac'].key() |
1015 | + key1_ad = c_map._root_node._items['a']._items['ad'].key() |
1016 | + c_map.map(('aaa',), 'new aaa value') |
1017 | + key2 = c_map._save() |
1018 | + key2_a = c_map._root_node._items['a'].key() |
1019 | + key2_aa = c_map._root_node._items['a']._items['aa'].key() |
1020 | + c_map.map(('acc',), 'new acc content') |
1021 | + key3 = c_map._save() |
1022 | + key3_a = c_map._root_node._items['a'].key() |
1023 | + key3_ac = c_map._root_node._items['a']._items['ac'].key() |
1024 | + diff = self.get_difference([key3], [key1, key2], |
1025 | + chk_map._search_key_plain) |
1026 | + root_results = [record.key for record in diff._read_all_roots()] |
1027 | + self.assertEqual([key3], root_results) |
1028 | + self.assertEqual(sorted([key1_a, key2_a]), |
1029 | + sorted(diff._old_queue)) |
1030 | + self.assertEqual([key3_a], diff._new_queue) |
1031 | + self.assertEqual([], diff._new_item_queue) |
1032 | + diff._process_next_old() |
1033 | + # All of the old records should be brought in and queued up, |
1034 | + # but we should not have any duplicates |
1035 | + self.assertEqual(sorted([key1_aa, key1_ab, key1_ac, key1_ad, key2_aa]), |
1036 | + sorted(diff._old_queue)) |
1037 | + |
1038 | + |
1039 | +class TestIterInterestingNodes(TestCaseWithExampleMaps): |
1040 | |
1041 | def get_map_key(self, a_dict, maximum_size=10): |
1042 | - c_map = self._get_map(a_dict, maximum_size=maximum_size, |
1043 | - chk_bytes=self.get_chk_bytes()) |
1044 | + c_map = self.get_map(a_dict, maximum_size=maximum_size) |
1045 | return c_map.key() |
1046 | |
1047 | - def assertIterInteresting(self, expected, interesting_keys, |
1048 | - uninteresting_keys): |
1049 | + def assertIterInteresting(self, records, items, interesting_keys, |
1050 | + old_keys): |
1051 | """Check the result of iter_interesting_nodes. |
1052 | |
1053 | - :param expected: A list of (record_keys, interesting_chk_pages, |
1054 | - interesting key value pairs) |
1055 | + Note that we no longer care how many steps are taken, etc, just that |
1056 | + the right contents are returned. |
1057 | + |
1058 | + :param records: A list of record keys that should be yielded |
1059 | + :param items: A list of items (key,value) that should be yielded. |
1060 | """ |
1061 | store = self.get_chk_bytes() |
1062 | + store._search_key_func = chk_map._search_key_plain |
1063 | iter_nodes = chk_map.iter_interesting_nodes(store, interesting_keys, |
1064 | - uninteresting_keys) |
1065 | - nodes = list(iter_nodes) |
1066 | - for count, (exp, act) in enumerate(izip(expected, nodes)): |
1067 | - exp_record, exp_items = exp |
1068 | - record, items = act |
1069 | - exp_tuple = (exp_record, sorted(exp_items)) |
1070 | - if record is None: |
1071 | - act_tuple = (None, sorted(items)) |
1072 | - else: |
1073 | - act_tuple = (record.key, sorted(items)) |
1074 | - self.assertEqual(exp_tuple, act_tuple, |
1075 | - 'entry %d did not match expected' % count) |
1076 | - self.assertEqual(len(expected), len(nodes)) |
1077 | + old_keys) |
1078 | + record_keys = [] |
1079 | + all_items = [] |
1080 | + for record, new_items in iter_nodes: |
1081 | + if record is not None: |
1082 | + record_keys.append(record.key) |
1083 | + if new_items: |
1084 | + all_items.extend(new_items) |
1085 | + self.assertEqual(sorted(records), sorted(record_keys)) |
1086 | + self.assertEqual(sorted(items), sorted(all_items)) |
1087 | |
1088 | def test_empty_to_one_keys(self): |
1089 | target = self.get_map_key({('a',): 'content'}) |
1090 | - self.assertIterInteresting( |
1091 | - [(target, [(('a',), 'content')]), |
1092 | - ], [target], []) |
1093 | + self.assertIterInteresting([target], |
1094 | + [(('a',), 'content')], |
1095 | + [target], []) |
1096 | |
1097 | def test_none_to_one_key(self): |
1098 | basis = self.get_map_key({}) |
1099 | target = self.get_map_key({('a',): 'content'}) |
1100 | - self.assertIterInteresting( |
1101 | - [(None, [(('a',), 'content')]), |
1102 | - (target, []), |
1103 | - ], [target], [basis]) |
1104 | + self.assertIterInteresting([target], |
1105 | + [(('a',), 'content')], |
1106 | + [target], [basis]) |
1107 | |
1108 | def test_one_to_none_key(self): |
1109 | basis = self.get_map_key({('a',): 'content'}) |
1110 | target = self.get_map_key({}) |
1111 | - self.assertIterInteresting( |
1112 | - [(target, [])], |
1113 | - [target], [basis]) |
1114 | + self.assertIterInteresting([target], |
1115 | + [], |
1116 | + [target], [basis]) |
1117 | |
1118 | def test_common_pages(self): |
1119 | basis = self.get_map_key({('a',): 'content', |
1120 | @@ -1964,10 +2546,9 @@ |
1121 | target_map._dump_tree()) |
1122 | b_key = target_map._root_node._items['b'].key() |
1123 | # This should return the root node, and the node for the 'b' key |
1124 | - self.assertIterInteresting( |
1125 | - [(target, []), |
1126 | - (b_key, [(('b',), 'other content')])], |
1127 | - [target], [basis]) |
1128 | + self.assertIterInteresting([target, b_key], |
1129 | + [(('b',), 'other content')], |
1130 | + [target], [basis]) |
1131 | |
1132 | def test_common_sub_page(self): |
1133 | basis = self.get_map_key({('aaa',): 'common', |
1134 | @@ -1991,12 +2572,11 @@ |
1135 | # The key for the internal aa node |
1136 | a_key = target_map._root_node._items['a'].key() |
1137 | # The key for the leaf aab node |
1138 | + # aaa_key = target_map._root_node._items['a']._items['aaa'].key() |
1139 | aab_key = target_map._root_node._items['a']._items['aab'].key() |
1140 | - self.assertIterInteresting( |
1141 | - [(target, []), |
1142 | - (a_key, []), |
1143 | - (aab_key, [(('aab',), 'new')])], |
1144 | - [target], [basis]) |
1145 | + self.assertIterInteresting([target, a_key, aab_key], |
1146 | + [(('aab',), 'new')], |
1147 | + [target], [basis]) |
1148 | |
1149 | def test_common_leaf(self): |
1150 | basis = self.get_map_key({}) |
1151 | @@ -2040,29 +2620,22 @@ |
1152 | a_key = target3_map._root_node._items['a'].key() |
1153 | aac_key = target3_map._root_node._items['a']._items['aac'].key() |
1154 | self.assertIterInteresting( |
1155 | - [(None, [(('aaa',), 'common')]), |
1156 | - (target1, []), |
1157 | - (target2, []), |
1158 | - (target3, []), |
1159 | - (b_key, [(('bbb',), 'new')]), |
1160 | - (a_key, []), |
1161 | - (aac_key, [(('aac',), 'other')]), |
1162 | - ], [target1, target2, target3], [basis]) |
1163 | - |
1164 | - self.assertIterInteresting( |
1165 | - [(target2, []), |
1166 | - (target3, []), |
1167 | - (b_key, [(('bbb',), 'new')]), |
1168 | - (a_key, []), |
1169 | - (aac_key, [(('aac',), 'other')]), |
1170 | - ], [target2, target3], [target1]) |
1171 | - |
1172 | - # This may be a case that we relax. A root node is a deep child of the |
1173 | - # excluded set. The cost is buffering root nodes until we have |
1174 | - # determined all possible exclusions. (Because a prefix of '', cannot |
1175 | - # be excluded.) |
1176 | - self.assertIterInteresting( |
1177 | - [], [target1], [target3]) |
1178 | + [target1, target2, target3, a_key, aac_key, b_key], |
1179 | + [(('aaa',), 'common'), (('bbb',), 'new'), (('aac',), 'other')], |
1180 | + [target1, target2, target3], [basis]) |
1181 | + |
1182 | + self.assertIterInteresting( |
1183 | + [target2, target3, a_key, aac_key, b_key], |
1184 | + [(('bbb',), 'new'), (('aac',), 'other')], |
1185 | + [target2, target3], [target1]) |
1186 | + |
1187 | + # Technically, target1 could be filtered out, but since it is a root |
1188 | + # node, we yield it immediately, rather than waiting to find out much |
1189 | + # later on. |
1190 | + self.assertIterInteresting( |
1191 | + [target1], |
1192 | + [], |
1193 | + [target1], [target3]) |
1194 | |
1195 | def test_multiple_maps(self): |
1196 | basis1 = self.get_map_key({('aaa',): 'common', |
1197 | @@ -2111,18 +2684,14 @@ |
1198 | # The key for the leaf bba node |
1199 | bba_key = target2_map._root_node._items['b']._items['bba'].key() |
1200 | self.assertIterInteresting( |
1201 | - [(target1, []), |
1202 | - (target2, []), |
1203 | - (a_key, []), |
1204 | - (b_key, []), |
1205 | - (aac_key, [(('aac',), 'target1')]), |
1206 | - (bba_key, [(('bba',), 'target2')]), |
1207 | - ], [target1, target2], [basis1, basis2]) |
1208 | + [target1, target2, a_key, aac_key, b_key, bba_key], |
1209 | + [(('aac',), 'target1'), (('bba',), 'target2')], |
1210 | + [target1, target2], [basis1, basis2]) |
1211 | |
1212 | def test_multiple_maps_overlapping_common_new(self): |
1213 | # Test that when a node found through the interesting_keys iteration |
1214 | - # for *some roots* and also via the uninteresting keys iteration, that |
1215 | - # it is still scanned for uninteresting refs and items, because its |
1216 | + # for *some roots* and also via the old keys iteration, that |
1217 | + # it is still scanned for old refs and items, because its |
1218 | # not truely new. This requires 2 levels of InternalNodes to expose, |
1219 | # because of the way the bootstrap in _find_children_info works. |
1220 | # This suggests that the code is probably amenable to/benefit from |
1221 | @@ -2188,17 +2757,10 @@ |
1222 | right_map._dump_tree()) |
1223 | # Keys from the right side target - none, the root is enough. |
1224 | # Test behaviour |
1225 | - self.expectFailure("we don't properly filter different depths", |
1226 | - self.assertIterInteresting, |
1227 | - [(left, []), |
1228 | - (right, []), |
1229 | - (l_d_key, [(('ddd',), 'change')]), |
1230 | - ], [left, right], [basis]) |
1231 | self.assertIterInteresting( |
1232 | - [(left, []), |
1233 | - (right, []), |
1234 | - (l_d_key, [(('ddd',), 'change')]), |
1235 | - ], [left, right], [basis]) |
1236 | + [right, left, l_d_key], |
1237 | + [(('ddd',), 'change')], |
1238 | + [left, right], [basis]) |
1239 | |
1240 | def test_multiple_maps_similar(self): |
1241 | # We want to have a depth=2 tree, with multiple entries in each leaf |
1242 | @@ -2259,8 +2821,6 @@ |
1243 | r_a_key = right_map._root_node._items['a'].key() |
1244 | r_c_key = right_map._root_node._items['c'].key() |
1245 | self.assertIterInteresting( |
1246 | - [(left, []), |
1247 | - (right, []), |
1248 | - (l_a_key, [(('abb',), 'changed left')]), |
1249 | - (r_c_key, [(('cbb',), 'changed right')]), |
1250 | - ], [left, right], [basis]) |
1251 | + [right, left, l_a_key, r_c_key], |
1252 | + [(('abb',), 'changed left'), (('cbb',), 'changed right')], |
1253 | + [left, right], [basis]) |
This is a slightly better fix for bug #390563. The main changes are:
1) Rewrite iter_interestin g_nodes using a class, rather than maintaining state between functions
2) Split up the processing of root nodes into:
a) Read all uninteresting roots, and find uninteresting chks and items
b) Filter out any interesting root keys that were found in (a)
c) Read all remaining root pages and yield them
d) Find interesting references that are not in (a) (and interesting items)
Note the search key prefixes for these references and items, and queue
them up to be returned later.
e) Filter out uninteresting chk pages based on the search key prefixes in (d)
(d) and (e) are the main differences.
For (d) Robert found an edge case where we could return a non-interesting item (which would, in turn, be a text key, which would then be a failure for fetch) when one side had an immediate leaf node, and the other side had a deep tree. The only real change is to just defer returning interesting *items* until all uninteresting pages have been read.
(e) Before we used the 'intersection of interesting references', whereas now we filter simply based on prefix. Likely to be a small win if there are lots of uninteresting pages (lots of merge parents that are not being transmitted).
3) After processing root nodes
a) Process the uninteresing queue until completion
b) Process the interesting queue, filtering out all uninteresting items.
(some minor things here, like doing batch '.difference()' checks rather than .difference on every record processed)
I looked at trying to use a heapq to unify things more with 'iter_changes' but the constraints are *very* different.
1) CHKMap. iter_changes( ) is symmetric. In that we need to return a record for (source, target, extra), for iter_interestin g_nodes, we *only* need to return information for the 'target' side. iter_changes( ) only ever effects 2 records, a single source and a single target. iter_interestin g_nodes is looking at a many-to-many comparison. iter_changes( ) only cares about the actual keys, and not about sending the raw byte records to be transmitted for fetch. g_nodes is much more concerned about batching because of its many-to-many nature, and the fact it is used for fetch. We really don't want to be doing a round trip per page.
2) CHKMap.
3) CHKMap.
4) iter_interestin
5) The only gain for doing heapq is if we could figure out the algorithm for filtering out nodes that we thought we would want to read, but really don't. However,
a) I do this at the root level, because it is fairly straightforward. So any further improvements only effect trees that are deeper than 2 (1 root, 1+ internal, 1 leaf), however this only happens for trees > 50k items.
b) Reading extra chk pages isn't "incorrect", just "inefficient" (the stacking code ensures all referenced chk pages are present in the stacked-on branch, just like we ensure compression texts.)
c) The best I could come up with would be to split the reads up by prefix. So you would page in the 'a' references, then the 'b' references, etc. However, that forces us to split the requests up even further, rather than larger batches of requests. (more...