Bazaar

Merge lp:~jameinel/bzr/2.1.0b4-convert-kg-heads into lp:bzr

2.1.0b4-convert-kg-heads
Merge into bzr.dev

Proposed by John A Meinel on 2009-11-30

Status:

Merged

Approved by:

Robert Collins on 2009-11-30

Approved revision:

no longer in the source branch.

Merged at revision:

not available

Proposed branch:

lp:~jameinel/bzr/2.1.0b4-convert-kg-heads

Merge into:

lp:bzr

Diff against target:

229 lines (+69/-16)

5 files modified

NEWS (+4/-0)
bzrlib/fetch.py (+16/-5)
bzrlib/graph.py (+13/-0)
bzrlib/repository.py (+18/-11)
bzrlib/tests/test_graph.py (+18/-0)

To merge this branch:

bzr merge lp:~jameinel/bzr/2.1.0b4-convert-kg-heads

Wishlist

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Robert Collins (community)		2009-11-30	Approve on 2009-11-30
Review via email: mp+15394@code.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2009-11-30:

See this graph:
http://launchpadlibrarian.net/36247556/conversion_times.png

For the net effect of this patch. In converting mysql from 1.9 => 2a it cuts the time for the conversion approx 2:1.

The basic observation was that we spend *way* too much time graph walking during the conversion. Part of that was because we didn't pass *any* graph object down, so we didn't have any benefit of caching, etc. So for all heads() requests we paid the full api layering (translating back and forth from strings <=> tuples for all entries in the history, etc.)

Anyway, we have a very fast heads implementation, so lets use it. I put in the check:

if ... len(revision_ids) > 100:

in there, so that we don't *always* load the full history for small conversion updates. However, for anything more than that, we probably get a net win out of loading the whole ancestry and using the faster heads implementation.

I wish there was a way I could put the image inline, like I used to be able to do with Thunderbird sending requests via Bundle Buggy. (At least, people using email clients would see the image right away...)

Revision history for this message

Robert Collins (lifeless) wrote on 2009-11-30:

I'm not sure that 100 is the right figure, could you do some timing data
on e.g. emacs and put forward a change to that figure if its wrong?

This branch can land now however.

review: approve
merge: approve

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2009-11-30:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> Review: Approve
> I'm not sure that 100 is the right figure, could you do some timing data
> on e.g. emacs and put forward a change to that figure if its wrong?
>
> This branch can land now however.
>
> review: approve
> merge: approve
>

So I don't think there is going to be an ideal value that is always correct.

The emacs history is going to be very different from mysql, for example.
(I've got about 70k mysql revs, but the mainline is only 2k-4k long
depending on the branch.)

However, repo.revisions.get_known_graph_ancestry([tip])

Takes somewhere between 0.5 to 1.0s. Conversions I've done are around 3
revs per second. This heads is about 2x faster, so if we assume

3r/s * 1.0s / 0.5 = 6revs.

So somewhere around 6 revs it will be faster to just load the whole
revision graph.

As mentioned, there are lots of variables. (I think Emacs heads() is
going to generally be cheaper because it is more linear, though it may
be more expensive because then you have to walk more revs?)

I could drop it to 10 revs if you would prefer. Adding 1.0s is pretty
trivial, and we only do this on 'local' access. (IDS only works on
file=>file, and it is being done on the 'source' side for smart streaming.)

Were you thinking the number should be bigger or smaller?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksT7JQACgkQJdeBCYSNAANbMgCfcqUZg3F6mfUsOGfCpdwjnKx/
0bcAn1QsHYKA4FW2cKvYnw5jnGj6OM6w
=QFMf
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2009-11-30:

So, all I knew was that 100 was a WAG :)

If the tradeoff point is 6 revisions on mysql, it would be great to also
have a figure for where it is on emacs, then we could triangulate.
However, it seems like 25 or 50 or something might be better for now
until we have more more data.

-Rob

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-02:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> So, all I knew was that 100 was a WAG :)
>
> If the tradeoff point is 6 revisions on mysql, it would be great to also
> have a figure for where it is on emacs, then we could triangulate.
> However, it seems like 25 or 50 or something might be better for now
> until we have more more data.
>
> -Rob
>

I'm having some difficulty downloading the emacs source. My laptop seems
to like to hiccup on new network connections periodically, and the time
to download all of emacs is outside of that time. I do have it
downloading on my server, which is a bit underpowered.

It did show our limitations over HTTP, as I hit 335MB downloaded, but
the repo is only about 206MB total. The download is also *very* choppy,
but that machine is woefully underpowered. (P-III 700MHz.)

Anyway, emacs seems to be 3.5s. So if you go with the same 50% +
3revs/sec you end up with 21.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksWev8ACgkQJdeBCYSNAAMv/wCeNGBEQdhdpOYKy+YcbBZ4E/Rs
MOwAnjyuNYtmU5wJ+Uh3azUscM0inIJV
=oHv6
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

John A Meinel

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

Bazaar

Merge lp:~jameinel/bzr/2.1.0b4-convert-kg-heads into lp:bzr

Commit message

Description of the change

Preview Diff

Subscribers

1	=== modified file 'NEWS'
2	--- NEWS 2009-11-27 10:43:47 +0000
3	+++ NEWS 2009-11-30 03:50:26 +0000
4	@@ -53,6 +53,10 @@
5	etc. Now only change slashes if there is something being glob expanded.
6	(John Arbash Meinel, #485771)
7
8	+* Use our faster ``KnownGraph.heads()`` functionality when computing the
9	+ new rich-root heads. This can cut a conversion time in half (mysql from
10	+ 13.5h => 6.2h) (John Arbash Meinel, #487632)
11	+
12	Improvements
13	************
14
15
16	=== modified file 'bzrlib/fetch.py'
17	--- bzrlib/fetch.py 2009-08-07 04:29:36 +0000
18	+++ bzrlib/fetch.py 2009-11-30 03:50:26 +0000
19	@@ -28,6 +28,8 @@
20	from bzrlib.lazy_import import lazy_import
21	lazy_import(globals(), """
22	from bzrlib import (
23	+ graph as _mod_graph,
24	+ static_tuple,
25	tsort,
26	versionedfile,
27	)
28	@@ -36,10 +38,10 @@
29	from bzrlib import (
30	errors,
31	symbol_versioning,
32	+ ui,
33	)
34	from bzrlib.revision import NULL_REVISION
35	from bzrlib.trace import mutter
36	-import bzrlib.ui
37
38
39	class RepoFetcher(object):
40	@@ -96,7 +98,7 @@
41	# assert not missing
42	self.count_total = 0
43	self.file_ids_names = {}
44	- pb = bzrlib.ui.ui_factory.nested_progress_bar()
45	+ pb = ui.ui_factory.nested_progress_bar()
46	pb.show_pct = pb.show_count = False
47	try:
48	pb.update("Finding revisions", 0, 2)
49	@@ -123,7 +125,7 @@
50	raise errors.IncompatibleRepositories(
51	self.from_repository, self.to_repository,
52	"different rich-root support")
53	- pb = bzrlib.ui.ui_factory.nested_progress_bar()
54	+ pb = ui.ui_factory.nested_progress_bar()
55	try:
56	pb.update("Get stream source")
57	source = self.from_repository._get_source(
58	@@ -251,13 +253,22 @@
59	# yet, and are unlikely to in non-rich-root environments anyway.
60	root_id_order.sort(key=operator.itemgetter(0))
61	# Create a record stream containing the roots to create.
62	- from bzrlib.graph import FrozenHeadsCache
63	- graph = FrozenHeadsCache(graph)
64	+ if len(revs) > 100:
65	+ graph = _get_rich_root_heads_graph(self.source_repo, revs)
66	new_roots_stream = _new_root_data_stream(
67	root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
68	return [('texts', new_roots_stream)]
69
70
71	+def _get_rich_root_heads_graph(source_repo, revision_ids):
72	+ """Get a Graph object suitable for asking heads() for new rich roots."""
73	+ st = static_tuple.StaticTuple
74	+ revision_keys = [st(r_id).intern() for r_id in revision_ids]
75	+ known_graph = source_repo.revisions.get_known_graph_ancestry(
76	+ revision_keys)
77	+ return graph.GraphThunkIdsToKeys(known_graph)
78	+
79	+
80	def _new_root_data_stream(
81	root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):
82	"""Generate a texts substream of synthesised root entries.
83
84	=== modified file 'bzrlib/graph.py'
85	--- bzrlib/graph.py 2009-09-14 01:48:28 +0000
86	+++ bzrlib/graph.py 2009-11-30 03:50:26 +0000
87	@@ -1679,6 +1679,19 @@
88	return result
89
90
91	+class GraphThunkIdsToKeys(object):
92	+ """Forwards calls about 'ids' to be about keys internally."""
93	+
94	+ def __init__(self, graph):
95	+ self._graph = graph
96	+
97	+ def heads(self, ids):
98	+ """See Graph.heads()"""
99	+ as_keys = [(i,) for i in ids]
100	+ head_keys = self._graph.heads(as_keys)
101	+ return set([h[0] for h in head_keys])
102	+
103	+
104	_counters = [0,0,0,0,0,0,0]
105	try:
106	from bzrlib._known_graph_pyx import KnownGraph
107
108	=== modified file 'bzrlib/repository.py'
109	--- bzrlib/repository.py 2009-11-19 15:06:47 +0000
110	+++ bzrlib/repository.py 2009-11-30 03:50:26 +0000
111	@@ -26,6 +26,7 @@
112	chk_map,
113	debug,
114	errors,
115	+ fetch as _mod_fetch,
116	fifo_cache,
117	generate_ids,
118	gpg,
119	@@ -2668,8 +2669,8 @@
120	for ((revision_id,), parent_keys) in \
121	self.revisions.get_parent_map(query_keys).iteritems():
122	if parent_keys:
123	- result[revision_id] = tuple(parent_revid
124	- for (parent_revid,) in parent_keys)
125	+ result[revision_id] = tuple([parent_revid
126	+ for (parent_revid,) in parent_keys])
127	else:
128	result[revision_id] = (_mod_revision.NULL_REVISION,)
129	return result
130	@@ -3412,8 +3413,7 @@
131	provided a default one will be created.
132	:return: None.
133	"""
134	- from bzrlib.fetch import RepoFetcher
135	- f = RepoFetcher(to_repository=self.target,
136	+ f = _mod_fetch.RepoFetcher(to_repository=self.target,
137	from_repository=self.source,
138	last_revision=revision_id,
139	fetch_spec=fetch_spec,
140	@@ -3819,13 +3819,15 @@
141	basis_id, delta, current_revision_id, parents_parents)
142	cache[current_revision_id] = parent_tree
143
144	- def _fetch_batch(self, revision_ids, basis_id, cache):
145	+ def _fetch_batch(self, revision_ids, basis_id, cache, a_graph=None):
146	"""Fetch across a few revisions.
147
148	:param revision_ids: The revisions to copy
149	:param basis_id: The revision_id of a tree that must be in cache, used
150	as a basis for delta when no other base is available
151	:param cache: A cache of RevisionTrees that we can use.
152	+ :param a_graph: A Graph object to determine the heads() of the
153	+ rich-root data stream.
154	:return: The revision_id of the last converted tree. The RevisionTree
155	for it will be in cache
156	"""
157	@@ -3895,10 +3897,9 @@
158	from_texts = self.source.texts
159	to_texts = self.target.texts
160	if root_keys_to_create:
161	- from bzrlib.fetch import _new_root_data_stream
162	- root_stream = _new_root_data_stream(
163	+ root_stream = _mod_fetch._new_root_data_stream(
164	root_keys_to_create, self._revision_id_to_root_id, parent_map,
165	- self.source)
166	+ self.source, graph=a_graph)
167	to_texts.insert_record_stream(root_stream)
168	to_texts.insert_record_stream(from_texts.get_record_stream(
169	text_keys, self.target._format._fetch_order,
170	@@ -3961,13 +3962,20 @@
171	cache[basis_id] = basis_tree
172	del basis_tree # We don't want to hang on to it here
173	hints = []
174	+ if self._converting_to_rich_root and len(revision_ids) > 100:
175	+ a_graph = _mod_fetch._get_rich_root_heads_graph(self.source,
176	+ revision_ids)
177	+ else:
178	+ a_graph = None
179	+
180	for offset in range(0, len(revision_ids), batch_size):
181	self.target.start_write_group()
182	try:
183	pb.update('Transferring revisions', offset,
184	len(revision_ids))
185	batch = revision_ids[offset:offset+batch_size]
186	- basis_id = self._fetch_batch(batch, basis_id, cache)
187	+ basis_id = self._fetch_batch(batch, basis_id, cache,
188	+ a_graph=a_graph)
189	except:
190	self.target.abort_write_group()
191	raise
192	@@ -4446,8 +4454,7 @@
193	fetching the inventory weave.
194	"""
195	if self._rich_root_upgrade():
196	- import bzrlib.fetch
197	- return bzrlib.fetch.Inter1and2Helper(
198	+ return _mod_fetch.Inter1and2Helper(
199	self.from_repository).generate_root_texts(revs)
200	else:
201	return []
202
203	=== modified file 'bzrlib/tests/test_graph.py'
204	--- bzrlib/tests/test_graph.py 2009-08-04 04:36:34 +0000
205	+++ bzrlib/tests/test_graph.py 2009-11-30 03:50:26 +0000
206	@@ -1582,6 +1582,24 @@
207	self.assertCollapsed(d, d)
208
209
210	+class TestGraphThunkIdsToKeys(tests.TestCase):
211	+
212	+ def test_heads(self):
213	+ # A
214	+ # \|\
215	+ # B C
216	+ # \|/
217	+ # D
218	+ d = {('D',): [('B',), ('C',)], ('C',):[('A',)],
219	+ ('B',): [('A',)], ('A',): []}
220	+ g = _mod_graph.Graph(_mod_graph.DictParentsProvider(d))
221	+ graph_thunk = _mod_graph.GraphThunkIdsToKeys(g)
222	+ self.assertEqual(['D'], sorted(graph_thunk.heads(['D', 'A'])))
223	+ self.assertEqual(['D'], sorted(graph_thunk.heads(['D', 'B'])))
224	+ self.assertEqual(['D'], sorted(graph_thunk.heads(['D', 'C'])))
225	+ self.assertEqual(['B', 'C'], sorted(graph_thunk.heads(['B', 'C'])))
226	+
227	+
228	class TestPendingAncestryResultGetKeys(TestCaseWithMemoryTransport):
229	"""Tests for bzrlib.graph.PendingAncestryResult."""
230