Merge lp:~jameinel/bzr/1.16-no-first-delta-index into lp:~bzr/bzr/trunk-old

Proposed by John A Meinel on 2009-06-02
Status: Merged
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/1.16-no-first-delta-index
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 130 lines
To merge this branch: bzr merge lp:~jameinel/bzr/1.16-no-first-delta-index
Reviewer Review Type Date Requested Status
Andrew Bennetts 2009-06-02 Approve on 2009-06-05
Review via email: mp+6991@code.launchpad.net
To post a comment you must log in.
John A Meinel (jameinel) wrote :

This makes a minor modification to the internal DeltaIndex structure for --dev6 repositories.

Namely, it makes generating the first _index "lazy". The idea is that until we do 'make_delta' or a similar function, we don't actually need the index. This has a fairly big impact on 'bzr commit', especially when combined with:
https://code.edge.launchpad.net/~jameinel/bzr/1.16-commit-fulltext/+merge/6988

With both patches, the peak memory consumption is right around 2x a single file content. I know the last bits that we could poke at to drop it down more, but it requires a bit more complexity.

As an added bonus, with both patches applied the time to commit of a single 90MB file in dev6 is down from 7.3s to 4.2s. I would imagine that the 'initial commit' benchmarks for large projects will also be down significantly.

Andrew Bennetts (spiv) wrote :

I haven't very carefully read the pyrex changes, but the diff is small and the description makes sense to me.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-06-04 21:05:44 +0000
3+++ NEWS 2009-06-05 02:35:36 +0000
4@@ -31,6 +31,12 @@
5 bugs with stacking and non default formats.)
6 (John Arbash Meinel, #373455)
7
8+* ``--development6-rich-root`` delays generating a delta index for the
9+ first object inserted into a group. This has a beneficial impact on
10+ ``bzr commit`` since each committed texts goes to its own group. For
11+ committing a 90MB file, it drops peak memory by about 200MB, and speeds
12+ up commit from 7s => 4s. (John Arbash Meinel)
13+
14 * Numerous operations are now faster for huge projects, i.e. those
15 with a large number of files and/or a large number of revisions,
16 particularly when the latest development format is used. These
17
18=== modified file 'bzrlib/_groupcompress_pyx.pyx'
19--- bzrlib/_groupcompress_pyx.pyx 2009-04-09 20:23:07 +0000
20+++ bzrlib/_groupcompress_pyx.pyx 2009-06-05 02:35:36 +0000
21@@ -118,6 +118,9 @@
22 self._index = NULL
23 safe_free(<void **>&self._source_infos)
24
25+ def _has_index(self):
26+ return (self._index != NULL)
27+
28 def add_delta_source(self, delta, unadded_bytes):
29 """Add a new delta to the source texts.
30
31@@ -171,6 +174,9 @@
32 source_location = len(self._sources)
33 if source_location >= self._max_num_sources:
34 self._expand_sources()
35+ if source_location != 0 and self._index == NULL:
36+ # We were lazy about populating the index, create it now
37+ self._populate_first_index()
38 self._sources.append(source)
39 c_source = PyString_AS_STRING(source)
40 c_source_size = PyString_GET_SIZE(source)
41@@ -179,11 +185,24 @@
42 src.size = c_source_size
43
44 src.agg_offset = self._source_offset + unadded_bytes
45- index = create_delta_index(src, self._index)
46 self._source_offset = src.agg_offset + src.size
47- if index != NULL:
48- free_delta_index(self._index)
49- self._index = index
50+ # We delay creating the index on the first insert
51+ if source_location != 0:
52+ index = create_delta_index(src, self._index)
53+ if index != NULL:
54+ free_delta_index(self._index)
55+ self._index = index
56+
57+ cdef _populate_first_index(self):
58+ cdef delta_index *index
59+ if len(self._sources) != 1 or self._index != NULL:
60+ raise AssertionError('_populate_first_index should only be'
61+ ' called when we have a single source and no index yet')
62+
63+ # We know that self._index is already NULL, so whatever
64+ # create_delta_index returns is fine
65+ self._index = create_delta_index(&self._source_infos[0], NULL)
66+ assert self._index != NULL
67
68 cdef _expand_sources(self):
69 raise RuntimeError('if we move self._source_infos, then we need to'
70@@ -201,7 +220,10 @@
71 cdef unsigned long delta_size
72
73 if self._index == NULL:
74- return None
75+ if len(self._sources) == 0:
76+ return None
77+ # We were just lazy about generating the index
78+ self._populate_first_index()
79
80 if not PyString_CheckExact(target_bytes):
81 raise TypeError('target is not a str')
82
83=== modified file 'bzrlib/groupcompress.py'
84--- bzrlib/groupcompress.py 2009-05-29 10:25:37 +0000
85+++ bzrlib/groupcompress.py 2009-06-05 02:35:36 +0000
86@@ -746,6 +746,14 @@
87
88 After calling this, the compressor should no longer be used
89 """
90+ # TODO: this causes us to 'bloat' to 2x the size of content in the
91+ # group. This has an impact for 'commit' of large objects.
92+ # One possibility is to use self._content_chunks, and be lazy and
93+ # only fill out self._content as a full string when we actually
94+ # need it. That would at least drop the peak memory consumption
95+ # for 'commit' down to ~1x the size of the largest file, at a
96+ # cost of increased complexity within this code. 2x is still <<
97+ # 3x the size of the largest file, so we are doing ok.
98 content = ''.join(self.chunks)
99 self.chunks = None
100 self._delta_index = None
101
102=== modified file 'bzrlib/tests/test__groupcompress.py'
103--- bzrlib/tests/test__groupcompress.py 2009-04-29 05:53:21 +0000
104+++ bzrlib/tests/test__groupcompress.py 2009-06-05 02:35:36 +0000
105@@ -272,6 +272,25 @@
106 di = self._gc_module.DeltaIndex('test text\n')
107 self.assertEqual('DeltaIndex(1, 10)', repr(di))
108
109+ def test_first_add_source_doesnt_index_until_make_delta(self):
110+ di = self._gc_module.DeltaIndex()
111+ self.assertFalse(di._has_index())
112+ di.add_source(_text1, 0)
113+ self.assertFalse(di._has_index())
114+ # However, asking to make a delta will trigger the index to be
115+ # generated, and will generate a proper delta
116+ delta = di.make_delta(_text2)
117+ self.assertTrue(di._has_index())
118+ self.assertEqual('N\x90/\x1fdiffer from\nagainst other text\n', delta)
119+
120+ def test_second_add_source_triggers_make_index(self):
121+ di = self._gc_module.DeltaIndex()
122+ self.assertFalse(di._has_index())
123+ di.add_source(_text1, 0)
124+ self.assertFalse(di._has_index())
125+ di.add_source(_text2, 0)
126+ self.assertTrue(di._has_index())
127+
128 def test_make_delta(self):
129 di = self._gc_module.DeltaIndex(_text1)
130 delta = di.make_delta(_text2)