Merge lp:~jameinel/bzr/2.1.0b4-convert-kg-heads into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: Robert Collins
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/2.1.0b4-convert-kg-heads
Merge into: lp:bzr
Diff against target: 229 lines (+69/-16)
5 files modified
NEWS (+4/-0)
bzrlib/fetch.py (+16/-5)
bzrlib/graph.py (+13/-0)
bzrlib/repository.py (+18/-11)
bzrlib/tests/test_graph.py (+18/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.1.0b4-convert-kg-heads
Reviewer Review Type Date Requested Status
Robert Collins (community) Approve
Review via email: mp+15394@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

See this graph:
http://launchpadlibrarian.net/36247556/conversion_times.png

For the net effect of this patch. In converting mysql from 1.9 => 2a it cuts the time for the conversion approx 2:1.

The basic observation was that we spend *way* too much time graph walking during the conversion. Part of that was because we didn't pass *any* graph object down, so we didn't have any benefit of caching, etc. So for all heads() requests we paid the full api layering (translating back and forth from strings <=> tuples for all entries in the history, etc.)

Anyway, we have a very fast heads implementation, so lets use it. I put in the check:

if ... len(revision_ids) > 100:

in there, so that we don't *always* load the full history for small conversion updates. However, for anything more than that, we probably get a net win out of loading the whole ancestry and using the faster heads implementation.

I wish there was a way I could put the image inline, like I used to be able to do with Thunderbird sending requests via Bundle Buggy. (At least, people using email clients would see the image right away...)

Revision history for this message
Robert Collins (lifeless) wrote :

I'm not sure that 100 is the right figure, could you do some timing data
on e.g. emacs and put forward a change to that figure if its wrong?

This branch can land now however.

 review: approve
 merge: approve

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> Review: Approve
> I'm not sure that 100 is the right figure, could you do some timing data
> on e.g. emacs and put forward a change to that figure if its wrong?
>
> This branch can land now however.
>
> review: approve
> merge: approve
>

So I don't think there is going to be an ideal value that is always correct.

The emacs history is going to be very different from mysql, for example.
(I've got about 70k mysql revs, but the mainline is only 2k-4k long
depending on the branch.)

However, repo.revisions.get_known_graph_ancestry([tip])

Takes somewhere between 0.5 to 1.0s. Conversions I've done are around 3
revs per second. This heads is about 2x faster, so if we assume

3r/s * 1.0s / 0.5 = 6revs.

So somewhere around 6 revs it will be faster to just load the whole
revision graph.

As mentioned, there are lots of variables. (I think Emacs heads() is
going to generally be cheaper because it is more linear, though it may
be more expensive because then you have to walk more revs?)

I could drop it to 10 revs if you would prefer. Adding 1.0s is pretty
trivial, and we only do this on 'local' access. (IDS only works on
file=>file, and it is being done on the 'source' side for smart streaming.)

Were you thinking the number should be bigger or smaller?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksT7JQACgkQJdeBCYSNAANbMgCfcqUZg3F6mfUsOGfCpdwjnKx/
0bcAn1QsHYKA4FW2cKvYnw5jnGj6OM6w
=QFMf
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

So, all I knew was that 100 was a WAG :)

If the tradeoff point is 6 revisions on mysql, it would be great to also
have a figure for where it is on emacs, then we could triangulate.
However, it seems like 25 or 50 or something might be better for now
until we have more more data.

-Rob

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> So, all I knew was that 100 was a WAG :)
>
> If the tradeoff point is 6 revisions on mysql, it would be great to also
> have a figure for where it is on emacs, then we could triangulate.
> However, it seems like 25 or 50 or something might be better for now
> until we have more more data.
>
> -Rob
>

I'm having some difficulty downloading the emacs source. My laptop seems
to like to hiccup on new network connections periodically, and the time
to download all of emacs is outside of that time. I do have it
downloading on my server, which is a bit underpowered.

It did show our limitations over HTTP, as I hit 335MB downloaded, but
the repo is only about 206MB total. The download is also *very* choppy,
but that machine is woefully underpowered. (P-III 700MHz.)

Anyway, emacs seems to be 3.5s. So if you go with the same 50% +
3revs/sec you end up with 21.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksWev8ACgkQJdeBCYSNAAMv/wCeNGBEQdhdpOYKy+YcbBZ4E/Rs
MOwAnjyuNYtmU5wJ+Uh3azUscM0inIJV
=oHv6
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-11-27 10:43:47 +0000
3+++ NEWS 2009-11-30 03:50:26 +0000
4@@ -53,6 +53,10 @@
5 etc. Now only change slashes if there is something being glob expanded.
6 (John Arbash Meinel, #485771)
7
8+* Use our faster ``KnownGraph.heads()`` functionality when computing the
9+ new rich-root heads. This can cut a conversion time in half (mysql from
10+ 13.5h => 6.2h) (John Arbash Meinel, #487632)
11+
12 Improvements
13 ************
14
15
16=== modified file 'bzrlib/fetch.py'
17--- bzrlib/fetch.py 2009-08-07 04:29:36 +0000
18+++ bzrlib/fetch.py 2009-11-30 03:50:26 +0000
19@@ -28,6 +28,8 @@
20 from bzrlib.lazy_import import lazy_import
21 lazy_import(globals(), """
22 from bzrlib import (
23+ graph as _mod_graph,
24+ static_tuple,
25 tsort,
26 versionedfile,
27 )
28@@ -36,10 +38,10 @@
29 from bzrlib import (
30 errors,
31 symbol_versioning,
32+ ui,
33 )
34 from bzrlib.revision import NULL_REVISION
35 from bzrlib.trace import mutter
36-import bzrlib.ui
37
38
39 class RepoFetcher(object):
40@@ -96,7 +98,7 @@
41 # assert not missing
42 self.count_total = 0
43 self.file_ids_names = {}
44- pb = bzrlib.ui.ui_factory.nested_progress_bar()
45+ pb = ui.ui_factory.nested_progress_bar()
46 pb.show_pct = pb.show_count = False
47 try:
48 pb.update("Finding revisions", 0, 2)
49@@ -123,7 +125,7 @@
50 raise errors.IncompatibleRepositories(
51 self.from_repository, self.to_repository,
52 "different rich-root support")
53- pb = bzrlib.ui.ui_factory.nested_progress_bar()
54+ pb = ui.ui_factory.nested_progress_bar()
55 try:
56 pb.update("Get stream source")
57 source = self.from_repository._get_source(
58@@ -251,13 +253,22 @@
59 # yet, and are unlikely to in non-rich-root environments anyway.
60 root_id_order.sort(key=operator.itemgetter(0))
61 # Create a record stream containing the roots to create.
62- from bzrlib.graph import FrozenHeadsCache
63- graph = FrozenHeadsCache(graph)
64+ if len(revs) > 100:
65+ graph = _get_rich_root_heads_graph(self.source_repo, revs)
66 new_roots_stream = _new_root_data_stream(
67 root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
68 return [('texts', new_roots_stream)]
69
70
71+def _get_rich_root_heads_graph(source_repo, revision_ids):
72+ """Get a Graph object suitable for asking heads() for new rich roots."""
73+ st = static_tuple.StaticTuple
74+ revision_keys = [st(r_id).intern() for r_id in revision_ids]
75+ known_graph = source_repo.revisions.get_known_graph_ancestry(
76+ revision_keys)
77+ return graph.GraphThunkIdsToKeys(known_graph)
78+
79+
80 def _new_root_data_stream(
81 root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):
82 """Generate a texts substream of synthesised root entries.
83
84=== modified file 'bzrlib/graph.py'
85--- bzrlib/graph.py 2009-09-14 01:48:28 +0000
86+++ bzrlib/graph.py 2009-11-30 03:50:26 +0000
87@@ -1679,6 +1679,19 @@
88 return result
89
90
91+class GraphThunkIdsToKeys(object):
92+ """Forwards calls about 'ids' to be about keys internally."""
93+
94+ def __init__(self, graph):
95+ self._graph = graph
96+
97+ def heads(self, ids):
98+ """See Graph.heads()"""
99+ as_keys = [(i,) for i in ids]
100+ head_keys = self._graph.heads(as_keys)
101+ return set([h[0] for h in head_keys])
102+
103+
104 _counters = [0,0,0,0,0,0,0]
105 try:
106 from bzrlib._known_graph_pyx import KnownGraph
107
108=== modified file 'bzrlib/repository.py'
109--- bzrlib/repository.py 2009-11-19 15:06:47 +0000
110+++ bzrlib/repository.py 2009-11-30 03:50:26 +0000
111@@ -26,6 +26,7 @@
112 chk_map,
113 debug,
114 errors,
115+ fetch as _mod_fetch,
116 fifo_cache,
117 generate_ids,
118 gpg,
119@@ -2668,8 +2669,8 @@
120 for ((revision_id,), parent_keys) in \
121 self.revisions.get_parent_map(query_keys).iteritems():
122 if parent_keys:
123- result[revision_id] = tuple(parent_revid
124- for (parent_revid,) in parent_keys)
125+ result[revision_id] = tuple([parent_revid
126+ for (parent_revid,) in parent_keys])
127 else:
128 result[revision_id] = (_mod_revision.NULL_REVISION,)
129 return result
130@@ -3412,8 +3413,7 @@
131 provided a default one will be created.
132 :return: None.
133 """
134- from bzrlib.fetch import RepoFetcher
135- f = RepoFetcher(to_repository=self.target,
136+ f = _mod_fetch.RepoFetcher(to_repository=self.target,
137 from_repository=self.source,
138 last_revision=revision_id,
139 fetch_spec=fetch_spec,
140@@ -3819,13 +3819,15 @@
141 basis_id, delta, current_revision_id, parents_parents)
142 cache[current_revision_id] = parent_tree
143
144- def _fetch_batch(self, revision_ids, basis_id, cache):
145+ def _fetch_batch(self, revision_ids, basis_id, cache, a_graph=None):
146 """Fetch across a few revisions.
147
148 :param revision_ids: The revisions to copy
149 :param basis_id: The revision_id of a tree that must be in cache, used
150 as a basis for delta when no other base is available
151 :param cache: A cache of RevisionTrees that we can use.
152+ :param a_graph: A Graph object to determine the heads() of the
153+ rich-root data stream.
154 :return: The revision_id of the last converted tree. The RevisionTree
155 for it will be in cache
156 """
157@@ -3895,10 +3897,9 @@
158 from_texts = self.source.texts
159 to_texts = self.target.texts
160 if root_keys_to_create:
161- from bzrlib.fetch import _new_root_data_stream
162- root_stream = _new_root_data_stream(
163+ root_stream = _mod_fetch._new_root_data_stream(
164 root_keys_to_create, self._revision_id_to_root_id, parent_map,
165- self.source)
166+ self.source, graph=a_graph)
167 to_texts.insert_record_stream(root_stream)
168 to_texts.insert_record_stream(from_texts.get_record_stream(
169 text_keys, self.target._format._fetch_order,
170@@ -3961,13 +3962,20 @@
171 cache[basis_id] = basis_tree
172 del basis_tree # We don't want to hang on to it here
173 hints = []
174+ if self._converting_to_rich_root and len(revision_ids) > 100:
175+ a_graph = _mod_fetch._get_rich_root_heads_graph(self.source,
176+ revision_ids)
177+ else:
178+ a_graph = None
179+
180 for offset in range(0, len(revision_ids), batch_size):
181 self.target.start_write_group()
182 try:
183 pb.update('Transferring revisions', offset,
184 len(revision_ids))
185 batch = revision_ids[offset:offset+batch_size]
186- basis_id = self._fetch_batch(batch, basis_id, cache)
187+ basis_id = self._fetch_batch(batch, basis_id, cache,
188+ a_graph=a_graph)
189 except:
190 self.target.abort_write_group()
191 raise
192@@ -4446,8 +4454,7 @@
193 fetching the inventory weave.
194 """
195 if self._rich_root_upgrade():
196- import bzrlib.fetch
197- return bzrlib.fetch.Inter1and2Helper(
198+ return _mod_fetch.Inter1and2Helper(
199 self.from_repository).generate_root_texts(revs)
200 else:
201 return []
202
203=== modified file 'bzrlib/tests/test_graph.py'
204--- bzrlib/tests/test_graph.py 2009-08-04 04:36:34 +0000
205+++ bzrlib/tests/test_graph.py 2009-11-30 03:50:26 +0000
206@@ -1582,6 +1582,24 @@
207 self.assertCollapsed(d, d)
208
209
210+class TestGraphThunkIdsToKeys(tests.TestCase):
211+
212+ def test_heads(self):
213+ # A
214+ # |\
215+ # B C
216+ # |/
217+ # D
218+ d = {('D',): [('B',), ('C',)], ('C',):[('A',)],
219+ ('B',): [('A',)], ('A',): []}
220+ g = _mod_graph.Graph(_mod_graph.DictParentsProvider(d))
221+ graph_thunk = _mod_graph.GraphThunkIdsToKeys(g)
222+ self.assertEqual(['D'], sorted(graph_thunk.heads(['D', 'A'])))
223+ self.assertEqual(['D'], sorted(graph_thunk.heads(['D', 'B'])))
224+ self.assertEqual(['D'], sorted(graph_thunk.heads(['D', 'C'])))
225+ self.assertEqual(['B', 'C'], sorted(graph_thunk.heads(['B', 'C'])))
226+
227+
228 class TestPendingAncestryResultGetKeys(TestCaseWithMemoryTransport):
229 """Tests for bzrlib.graph.PendingAncestryResult."""
230