Merge lp:~mbp/bzr/bigstring into lp:bzr

Proposed by Martin Pool
Status: Rejected
Rejected by: Martin Packman
Proposed branch: lp:~mbp/bzr/bigstring
Merge into: lp:bzr
Diff against target: 236 lines (+162/-4)
5 files modified
bzrlib/bigstring.py (+90/-0)
bzrlib/groupcompress.py (+18/-4)
bzrlib/help_topics/en/debug-flags.txt (+1/-0)
bzrlib/tests/__init__.py (+2/-0)
bzrlib/tests/test_bigstring.py (+51/-0)
To merge this branch: bzr merge lp:~mbp/bzr/bigstring
Reviewer Review Type Date Requested Status
Martin Packman (community) Disapprove
Review via email: mp+83732@code.launchpad.net

Description of the change

This adds a BigString class that is basically like a list of byte strings that is not bounded by memory.

It is in of bug 890085, memory usage, but it does not yet clearly help with memory usage, so I'm not sure I actually want to merge it. I'm going to propose it here though because it's pretty self-contained.

To post a comment you must log in.
Revision history for this message
Martin Packman (gz) wrote :

One thing I'm not crazy about here is in almost all cases we're just going to be writing the data to disk anyway, and we carefully use a compressor that supports incremental encoding, but then need all the output in one list anyway. All to support a needlessly abstract API that wants to deal in streams but provide a length header. So, we hurt the common case by writing to disk twice with this change to avoid fixing the higher levels.

review: Abstain
Revision history for this message
Martin Pool (mbp) wrote :

On 2 December 2011 22:17, Martin Packman <email address hidden> wrote:
> Review: Abstain
>
> One thing I'm not crazy about here is in almost all cases we're just going to be writing the data to disk anyway, and we carefully use a compressor that supports incremental encoding, but then need all the output in one list anyway. All to support a needlessly abstract API that wants to deal in streams but provide a length header. So, we hurt the common case by writing to disk twice with this change to avoid fixing the higher levels.

Yes, obviously there is some friction in the APIs, and this is not a
magic bullet for fixing that.

Beyond the API, we do have a length-prefixed format, unless we either
can think of some clever way to write multiple chunks in a compatible
way, or we decide to handle large files by a different format flag, or
there is some other way around it (is there?). To write that format
we need to know the length of the compressed data before we write the
header, therefore we need to buffer up all the compressed data. At the
moment we buffer it in memory; we could instead buffer it to disk at
least for large blocks. When there is physical memory free writing to
a temporary file and then erasing it shouldn't be much slower than
just holding it in memory; when physical memory is low it will
probably cope much better.

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

...I'm not going to merge this unless it clearly does help; I've just
put it up here because it is fairly self contained.

Revision history for this message
Martin Packman (gz) wrote :

Yes, having the length prefix in the format does complicate things. The best idea I could come up with was to always write the pack to disk first, even when streaming to a remote location. Then it would be possible to leave log10(uncompressed_len + overhead) bytes at the start, then when finalizing the compressobj seek back and write in the compressed length with leading zeros. That may still be a format change.

Revision history for this message
Martin Pool (mbp) wrote :

On 8 December 2011 01:46, Martin Packman <email address hidden> wrote:
> Yes, having the length prefix in the format does complicate things. The best idea I could come up with was to always write the pack to disk first, even when streaming to a remote location. Then it would be possible to leave log10(uncompressed_len + overhead) bytes at the start, then when finalizing the compressobj seek back and write in the compressed length with leading zeros. That may still be a format change.

seeking won't work well on a network stream :)

--
Martin

Revision history for this message
Martin Packman (gz) wrote :

> On 8 December 2011 01:46, Martin Packman <email address hidden> wrote:
> > Yes, having the length prefix in the format does complicate things. The best
> idea I could come up with was to always write the pack to disk first, even
> when streaming to a remote location. Then it would be possible to leave
> log10(uncompressed_len + overhead) bytes at the start, then when finalizing
> the compressobj seek back and write in the compressed length with leading
> zeros. That may still be a format change.
>
> seeking won't work well on a network stream :)

Hence the need to always write to disk first. It means delaying sending any data till the whole compressed block is completed, but that's already the case now as the whole lot is compressed in memory in order to get the size for the header.

Revision history for this message
Vincent Ladeuil (vila) wrote :

An alternative to backpatching is to send chunks with a marker indicating that there are more to come.
The receiving side can then buffer them (memory or disk) until the last chunk is received.

Revision history for this message
Martin Packman (gz) wrote :

Without follow up work I don't think it's useful to land this, the code can be refereed to later even if it's not merged.

review: Disapprove
Revision history for this message
Martin Pool (mbp) wrote :

I agree (regretfully).

Unmerged revisions

6262. By Martin Pool

Remove stray pdb call

6261. By Martin Pool

Show the bigstring temp file name

6260. By Martin Pool

Use BigString from groupcompress and add some debug help

6259. By Martin Pool

Add basic BigString implementation

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added file 'bzrlib/bigstring.py'
2--- bzrlib/bigstring.py 1970-01-01 00:00:00 +0000
3+++ bzrlib/bigstring.py 2011-11-29 07:33:40 +0000
4@@ -0,0 +1,90 @@
5+# Copyright (C) 2011 Canonical Ltd
6+#
7+# This program is free software; you can redistribute it and/or modify
8+# it under the terms of the GNU General Public License as published by
9+# the Free Software Foundation; either version 2 of the License, or
10+# (at your option) any later version.
11+#
12+# This program is distributed in the hope that it will be useful,
13+# but WITHOUT ANY WARRANTY; without even the implied warranty of
14+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15+# GNU General Public License for more details.
16+#
17+# You should have received a copy of the GNU General Public License
18+# along with this program; if not, write to the Free Software
19+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
20+
21+
22+"""Large bytestring-like object that spills to disk."""
23+
24+
25+import tempfile
26+
27+
28+class BigString(object):
29+ """Large bytestring-like object that spills to disk.
30+
31+ BigStrings can have content appended to them, can report their length,
32+ and can be read back as a series of chunks.
33+
34+ The API is such as to encourage client code to be written in a way
35+ that will never hold the whole string in memory.
36+
37+ Typical usage:
38+
39+ >>> bs = BigString()
40+ >>> bs.append_chunks(' %d' % i for i in range(4))
41+ >>> len(bs)
42+ 8
43+ >>> ''.join(bs.get_chunks())
44+ ' 0 1 2 3'
45+
46+ The last of these, joining all the chunks up in to one big in-memory
47+ string, defeats the purpose of the whole thing. Don't do that. Write them
48+ to a file one at a time.
49+ """
50+
51+ # TODO: Keep it in memory when it's small, and just create a file if it
52+ # actually gets big.
53+
54+ def __init__(self):
55+ # TODO: Maybe pass a 0 bufsiz, if we're sure we'll always just do large
56+ # writes...
57+ #
58+ # TODO: Maybe fadvise when possible.
59+ self._file = tempfile.NamedTemporaryFile(
60+ mode='ab+',
61+ prefix='bzrbigstr-')
62+ self._length = 0
63+ self._read_chunk_size = 16384
64+
65+ def __repr__(self):
66+ return "%s(%r)" % (
67+ self.__class__.__name__,
68+ self._file.name)
69+
70+ def append_chunks(self, chunks):
71+ for b in chunks:
72+ if not isinstance(b, str):
73+ raise TypeError(type(b))
74+ # File is in append mode, so even if we've seeked back this is safe.
75+ self._file.write(b)
76+ self._length += len(b)
77+
78+ def __len__(self):
79+ return self._length
80+
81+ def get_chunks(self):
82+ self._file.flush() # To the OS, not necessarily to disk.
83+ self._file.seek(0)
84+ yielded_bytes = 0
85+ while True:
86+ c = self._file.read(self._read_chunk_size)
87+ if not c:
88+ break
89+ yield c
90+ yielded_bytes += len(c)
91+ if yielded_bytes != self._length:
92+ raise AssertionError(
93+ "%r expected to get back %d bytes, actually returned %d" %
94+ (self, self._length, yielded_bytes))
95
96=== modified file 'bzrlib/groupcompress.py'
97--- bzrlib/groupcompress.py 2011-11-18 05:13:19 +0000
98+++ bzrlib/groupcompress.py 2011-11-29 07:33:40 +0000
99@@ -17,12 +17,14 @@
100 """Core compression logic for compressing streams of related files."""
101
102 import time
103+import warnings
104 import zlib
105
106 from bzrlib.lazy_import import lazy_import
107 lazy_import(globals(), """
108 from bzrlib import (
109 annotate,
110+ bigstring,
111 config,
112 debug,
113 errors,
114@@ -298,11 +300,17 @@
115 compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION)
116 # Peak in this point is 1 fulltext, 1 compressed text, + zlib overhead
117 # (measured peak is maybe 30MB over the above...)
118- compressed_chunks = map(compressor.compress, chunks)
119- compressed_chunks.append(compressor.flush())
120+ if 'groupcompress' in debug.debug_flags:
121+ trace.mutter(
122+ '_create_z_content_from_chunks: %d chunks total %d bytes' % (
123+ len(chunks), sum(map(len, chunks))))
124+ bigs = bigstring.BigString()
125+ bigs.append_chunks(compressor.compress(c) for c in chunks)
126+ bigs.append_chunks([compressor.flush()])
127+ compressor = None # Can't delete because of Python quirk?
128 # Ignore empty chunks
129- self._z_content_chunks = [c for c in compressed_chunks if c]
130- self._z_content_length = sum(map(len, self._z_content_chunks))
131+ self._z_content_chunks = list(bigs.get_chunks())
132+ self._z_content_length = len(bigs)
133
134 def _create_z_content(self):
135 if self._z_content_chunks is not None:
136@@ -826,6 +834,12 @@
137 if nostore_sha == _null_sha1:
138 raise errors.ExistingContent()
139 return _null_sha1, 0, 0, 'fulltext'
140+ if 'groupcompress' in debug.debug_flags:
141+ trace.mutter("compress %d bytes" % len(bytes))
142+ if len(bytes) > 10<<20:
143+ warnings.warn("groupcompress given over-large %d byte block"
144+ % len(bytes))
145+ import pdb;pdb.set_trace()
146 # we assume someone knew what they were doing when they passed it in
147 if expected_sha is not None:
148 sha1 = expected_sha
149
150=== modified file 'bzrlib/help_topics/en/debug-flags.txt'
151--- bzrlib/help_topics/en/debug-flags.txt 2011-10-10 13:51:29 +0000
152+++ bzrlib/help_topics/en/debug-flags.txt 2011-11-29 07:33:40 +0000
153@@ -15,6 +15,7 @@
154 -Dfilters Emit information for debugging content filtering.
155 -Dforceinvdeltas Force use of inventory deltas during generic streaming fetch.
156 -Dgraph Trace graph traversal.
157+-Dgroupcompress Groupcompress internals.
158 -Dhashcache Log every time a working file is read to determine its hash.
159 -Dhooks Trace hook execution.
160 -Dhpss Trace smart protocol requests and responses.
161
162=== modified file 'bzrlib/tests/__init__.py'
163--- bzrlib/tests/__init__.py 2011-11-29 06:50:54 +0000
164+++ bzrlib/tests/__init__.py 2011-11-29 07:33:40 +0000
165@@ -3928,6 +3928,7 @@
166 'bzrlib.tests.test_api',
167 'bzrlib.tests.test_atomicfile',
168 'bzrlib.tests.test_bad_files',
169+ 'bzrlib.tests.test_bigstring',
170 'bzrlib.tests.test_bisect_multi',
171 'bzrlib.tests.test_branch',
172 'bzrlib.tests.test_branchbuilder',
173@@ -4104,6 +4105,7 @@
174 return []
175 return [
176 'bzrlib',
177+ 'bzrlib.bigstring',
178 'bzrlib.branchbuilder',
179 'bzrlib.decorators',
180 'bzrlib.inventory',
181
182=== added file 'bzrlib/tests/test_bigstring.py'
183--- bzrlib/tests/test_bigstring.py 1970-01-01 00:00:00 +0000
184+++ bzrlib/tests/test_bigstring.py 2011-11-29 07:33:40 +0000
185@@ -0,0 +1,51 @@
186+# Copyright (C) 2011 Canonical Ltd
187+#
188+# This program is free software; you can redistribute it and/or modify
189+# it under the terms of the GNU General Public License as published by
190+# the Free Software Foundation; either version 2 of the License, or
191+# (at your option) any later version.
192+#
193+# This program is distributed in the hope that it will be useful,
194+# but WITHOUT ANY WARRANTY; without even the implied warranty of
195+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
196+# GNU General Public License for more details.
197+#
198+# You should have received a copy of the GNU General Public License
199+# along with this program; if not, write to the Free Software
200+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
201+
202+"""Tests for BigString"""
203+
204+from testtools.matchers import (
205+ Matcher,
206+ StartsWith,
207+ )
208+
209+from bzrlib.bigstring import BigString
210+from bzrlib.tests import TestCase
211+
212+
213+class TestBigString(TestCase):
214+
215+ def test_repr(self):
216+ bs = BigString()
217+ self.assertThat(repr(bs),
218+ StartsWith('BigString('))
219+
220+ def test_simple_get_chunks(self):
221+ bs = BigString()
222+ bs.append_chunks(['hello', 'world'])
223+ self.assertEqual(''.join(bs.get_chunks()),
224+ 'helloworld')
225+ self.assertEqual(len(bs), 10)
226+
227+ def test_get_chunks_limited_size(self):
228+ bs = BigString()
229+ bs.append_chunks(['hello', 'world', '!'])
230+ bs._read_chunk_size = 3
231+ self.assertEqual(
232+ list(bs.get_chunks()),
233+ ['hel', 'low', 'orl', 'd!'])
234+ self.assertEqual(
235+ len(bs),
236+ 11)