Merge lp:~mbp/bzr/bigstring into lp:bzr

Proposed by Martin Pool
Status: Rejected
Rejected by: Martin Packman
Proposed branch: lp:~mbp/bzr/bigstring
Merge into: lp:bzr
Diff against target: 236 lines (+162/-4)
5 files modified
bzrlib/bigstring.py (+90/-0)
bzrlib/groupcompress.py (+18/-4)
bzrlib/help_topics/en/debug-flags.txt (+1/-0)
bzrlib/tests/__init__.py (+2/-0)
bzrlib/tests/test_bigstring.py (+51/-0)
To merge this branch: bzr merge lp:~mbp/bzr/bigstring
Reviewer Review Type Date Requested Status
Martin Packman (community) Disapprove
Review via email: mp+83732@code.launchpad.net

Description of the change

This adds a BigString class that is basically like a list of byte strings that is not bounded by memory.

It is in of bug 890085, memory usage, but it does not yet clearly help with memory usage, so I'm not sure I actually want to merge it. I'm going to propose it here though because it's pretty self-contained.

To post a comment you must log in.
Revision history for this message
Martin Packman (gz) wrote :

One thing I'm not crazy about here is in almost all cases we're just going to be writing the data to disk anyway, and we carefully use a compressor that supports incremental encoding, but then need all the output in one list anyway. All to support a needlessly abstract API that wants to deal in streams but provide a length header. So, we hurt the common case by writing to disk twice with this change to avoid fixing the higher levels.

review: Abstain
Revision history for this message
Martin Pool (mbp) wrote :

On 2 December 2011 22:17, Martin Packman <email address hidden> wrote:
> Review: Abstain
>
> One thing I'm not crazy about here is in almost all cases we're just going to be writing the data to disk anyway, and we carefully use a compressor that supports incremental encoding, but then need all the output in one list anyway. All to support a needlessly abstract API that wants to deal in streams but provide a length header. So, we hurt the common case by writing to disk twice with this change to avoid fixing the higher levels.

Yes, obviously there is some friction in the APIs, and this is not a
magic bullet for fixing that.

Beyond the API, we do have a length-prefixed format, unless we either
can think of some clever way to write multiple chunks in a compatible
way, or we decide to handle large files by a different format flag, or
there is some other way around it (is there?). To write that format
we need to know the length of the compressed data before we write the
header, therefore we need to buffer up all the compressed data. At the
moment we buffer it in memory; we could instead buffer it to disk at
least for large blocks. When there is physical memory free writing to
a temporary file and then erasing it shouldn't be much slower than
just holding it in memory; when physical memory is low it will
probably cope much better.

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

...I'm not going to merge this unless it clearly does help; I've just
put it up here because it is fairly self contained.

Revision history for this message
Martin Packman (gz) wrote :

Yes, having the length prefix in the format does complicate things. The best idea I could come up with was to always write the pack to disk first, even when streaming to a remote location. Then it would be possible to leave log10(uncompressed_len + overhead) bytes at the start, then when finalizing the compressobj seek back and write in the compressed length with leading zeros. That may still be a format change.

Revision history for this message
Martin Pool (mbp) wrote :

On 8 December 2011 01:46, Martin Packman <email address hidden> wrote:
> Yes, having the length prefix in the format does complicate things. The best idea I could come up with was to always write the pack to disk first, even when streaming to a remote location. Then it would be possible to leave log10(uncompressed_len + overhead) bytes at the start, then when finalizing the compressobj seek back and write in the compressed length with leading zeros. That may still be a format change.

seeking won't work well on a network stream :)

--
Martin

Revision history for this message
Martin Packman (gz) wrote :

> On 8 December 2011 01:46, Martin Packman <email address hidden> wrote:
> > Yes, having the length prefix in the format does complicate things. The best
> idea I could come up with was to always write the pack to disk first, even
> when streaming to a remote location. Then it would be possible to leave
> log10(uncompressed_len + overhead) bytes at the start, then when finalizing
> the compressobj seek back and write in the compressed length with leading
> zeros. That may still be a format change.
>
> seeking won't work well on a network stream :)

Hence the need to always write to disk first. It means delaying sending any data till the whole compressed block is completed, but that's already the case now as the whole lot is compressed in memory in order to get the size for the header.

Revision history for this message
Vincent Ladeuil (vila) wrote :

An alternative to backpatching is to send chunks with a marker indicating that there are more to come.
The receiving side can then buffer them (memory or disk) until the last chunk is received.

Revision history for this message
Martin Packman (gz) wrote :

Without follow up work I don't think it's useful to land this, the code can be refereed to later even if it's not merged.

review: Disapprove
Revision history for this message
Martin Pool (mbp) wrote :

I agree (regretfully).

Unmerged revisions

6262. By Martin Pool

Remove stray pdb call

6261. By Martin Pool

Show the bigstring temp file name

6260. By Martin Pool

Use BigString from groupcompress and add some debug help

6259. By Martin Pool

Add basic BigString implementation

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== added file 'bzrlib/bigstring.py'
--- bzrlib/bigstring.py 1970-01-01 00:00:00 +0000
+++ bzrlib/bigstring.py 2011-11-29 07:33:40 +0000
@@ -0,0 +1,90 @@
1# Copyright (C) 2011 Canonical Ltd
2#
3# This program is free software; you can redistribute it and/or modify
4# it under the terms of the GNU General Public License as published by
5# the Free Software Foundation; either version 2 of the License, or
6# (at your option) any later version.
7#
8# This program is distributed in the hope that it will be useful,
9# but WITHOUT ANY WARRANTY; without even the implied warranty of
10# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11# GNU General Public License for more details.
12#
13# You should have received a copy of the GNU General Public License
14# along with this program; if not, write to the Free Software
15# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
16
17
18"""Large bytestring-like object that spills to disk."""
19
20
21import tempfile
22
23
24class BigString(object):
25 """Large bytestring-like object that spills to disk.
26
27 BigStrings can have content appended to them, can report their length,
28 and can be read back as a series of chunks.
29
30 The API is such as to encourage client code to be written in a way
31 that will never hold the whole string in memory.
32
33 Typical usage:
34
35 >>> bs = BigString()
36 >>> bs.append_chunks(' %d' % i for i in range(4))
37 >>> len(bs)
38 8
39 >>> ''.join(bs.get_chunks())
40 ' 0 1 2 3'
41
42 The last of these, joining all the chunks up in to one big in-memory
43 string, defeats the purpose of the whole thing. Don't do that. Write them
44 to a file one at a time.
45 """
46
47 # TODO: Keep it in memory when it's small, and just create a file if it
48 # actually gets big.
49
50 def __init__(self):
51 # TODO: Maybe pass a 0 bufsiz, if we're sure we'll always just do large
52 # writes...
53 #
54 # TODO: Maybe fadvise when possible.
55 self._file = tempfile.NamedTemporaryFile(
56 mode='ab+',
57 prefix='bzrbigstr-')
58 self._length = 0
59 self._read_chunk_size = 16384
60
61 def __repr__(self):
62 return "%s(%r)" % (
63 self.__class__.__name__,
64 self._file.name)
65
66 def append_chunks(self, chunks):
67 for b in chunks:
68 if not isinstance(b, str):
69 raise TypeError(type(b))
70 # File is in append mode, so even if we've seeked back this is safe.
71 self._file.write(b)
72 self._length += len(b)
73
74 def __len__(self):
75 return self._length
76
77 def get_chunks(self):
78 self._file.flush() # To the OS, not necessarily to disk.
79 self._file.seek(0)
80 yielded_bytes = 0
81 while True:
82 c = self._file.read(self._read_chunk_size)
83 if not c:
84 break
85 yield c
86 yielded_bytes += len(c)
87 if yielded_bytes != self._length:
88 raise AssertionError(
89 "%r expected to get back %d bytes, actually returned %d" %
90 (self, self._length, yielded_bytes))
091
=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py 2011-11-18 05:13:19 +0000
+++ bzrlib/groupcompress.py 2011-11-29 07:33:40 +0000
@@ -17,12 +17,14 @@
17"""Core compression logic for compressing streams of related files."""17"""Core compression logic for compressing streams of related files."""
1818
19import time19import time
20import warnings
20import zlib21import zlib
2122
22from bzrlib.lazy_import import lazy_import23from bzrlib.lazy_import import lazy_import
23lazy_import(globals(), """24lazy_import(globals(), """
24from bzrlib import (25from bzrlib import (
25 annotate,26 annotate,
27 bigstring,
26 config,28 config,
27 debug,29 debug,
28 errors,30 errors,
@@ -298,11 +300,17 @@
298 compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION)300 compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION)
299 # Peak in this point is 1 fulltext, 1 compressed text, + zlib overhead301 # Peak in this point is 1 fulltext, 1 compressed text, + zlib overhead
300 # (measured peak is maybe 30MB over the above...)302 # (measured peak is maybe 30MB over the above...)
301 compressed_chunks = map(compressor.compress, chunks)303 if 'groupcompress' in debug.debug_flags:
302 compressed_chunks.append(compressor.flush())304 trace.mutter(
305 '_create_z_content_from_chunks: %d chunks total %d bytes' % (
306 len(chunks), sum(map(len, chunks))))
307 bigs = bigstring.BigString()
308 bigs.append_chunks(compressor.compress(c) for c in chunks)
309 bigs.append_chunks([compressor.flush()])
310 compressor = None # Can't delete because of Python quirk?
303 # Ignore empty chunks311 # Ignore empty chunks
304 self._z_content_chunks = [c for c in compressed_chunks if c]312 self._z_content_chunks = list(bigs.get_chunks())
305 self._z_content_length = sum(map(len, self._z_content_chunks))313 self._z_content_length = len(bigs)
306314
307 def _create_z_content(self):315 def _create_z_content(self):
308 if self._z_content_chunks is not None:316 if self._z_content_chunks is not None:
@@ -826,6 +834,12 @@
826 if nostore_sha == _null_sha1:834 if nostore_sha == _null_sha1:
827 raise errors.ExistingContent()835 raise errors.ExistingContent()
828 return _null_sha1, 0, 0, 'fulltext'836 return _null_sha1, 0, 0, 'fulltext'
837 if 'groupcompress' in debug.debug_flags:
838 trace.mutter("compress %d bytes" % len(bytes))
839 if len(bytes) > 10<<20:
840 warnings.warn("groupcompress given over-large %d byte block"
841 % len(bytes))
842 import pdb;pdb.set_trace()
829 # we assume someone knew what they were doing when they passed it in843 # we assume someone knew what they were doing when they passed it in
830 if expected_sha is not None:844 if expected_sha is not None:
831 sha1 = expected_sha845 sha1 = expected_sha
832846
=== modified file 'bzrlib/help_topics/en/debug-flags.txt'
--- bzrlib/help_topics/en/debug-flags.txt 2011-10-10 13:51:29 +0000
+++ bzrlib/help_topics/en/debug-flags.txt 2011-11-29 07:33:40 +0000
@@ -15,6 +15,7 @@
15-Dfilters Emit information for debugging content filtering.15-Dfilters Emit information for debugging content filtering.
16-Dforceinvdeltas Force use of inventory deltas during generic streaming fetch.16-Dforceinvdeltas Force use of inventory deltas during generic streaming fetch.
17-Dgraph Trace graph traversal.17-Dgraph Trace graph traversal.
18-Dgroupcompress Groupcompress internals.
18-Dhashcache Log every time a working file is read to determine its hash.19-Dhashcache Log every time a working file is read to determine its hash.
19-Dhooks Trace hook execution.20-Dhooks Trace hook execution.
20-Dhpss Trace smart protocol requests and responses.21-Dhpss Trace smart protocol requests and responses.
2122
=== modified file 'bzrlib/tests/__init__.py'
--- bzrlib/tests/__init__.py 2011-11-29 06:50:54 +0000
+++ bzrlib/tests/__init__.py 2011-11-29 07:33:40 +0000
@@ -3928,6 +3928,7 @@
3928 'bzrlib.tests.test_api',3928 'bzrlib.tests.test_api',
3929 'bzrlib.tests.test_atomicfile',3929 'bzrlib.tests.test_atomicfile',
3930 'bzrlib.tests.test_bad_files',3930 'bzrlib.tests.test_bad_files',
3931 'bzrlib.tests.test_bigstring',
3931 'bzrlib.tests.test_bisect_multi',3932 'bzrlib.tests.test_bisect_multi',
3932 'bzrlib.tests.test_branch',3933 'bzrlib.tests.test_branch',
3933 'bzrlib.tests.test_branchbuilder',3934 'bzrlib.tests.test_branchbuilder',
@@ -4104,6 +4105,7 @@
4104 return []4105 return []
4105 return [4106 return [
4106 'bzrlib',4107 'bzrlib',
4108 'bzrlib.bigstring',
4107 'bzrlib.branchbuilder',4109 'bzrlib.branchbuilder',
4108 'bzrlib.decorators',4110 'bzrlib.decorators',
4109 'bzrlib.inventory',4111 'bzrlib.inventory',
41104112
=== added file 'bzrlib/tests/test_bigstring.py'
--- bzrlib/tests/test_bigstring.py 1970-01-01 00:00:00 +0000
+++ bzrlib/tests/test_bigstring.py 2011-11-29 07:33:40 +0000
@@ -0,0 +1,51 @@
1# Copyright (C) 2011 Canonical Ltd
2#
3# This program is free software; you can redistribute it and/or modify
4# it under the terms of the GNU General Public License as published by
5# the Free Software Foundation; either version 2 of the License, or
6# (at your option) any later version.
7#
8# This program is distributed in the hope that it will be useful,
9# but WITHOUT ANY WARRANTY; without even the implied warranty of
10# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11# GNU General Public License for more details.
12#
13# You should have received a copy of the GNU General Public License
14# along with this program; if not, write to the Free Software
15# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
16
17"""Tests for BigString"""
18
19from testtools.matchers import (
20 Matcher,
21 StartsWith,
22 )
23
24from bzrlib.bigstring import BigString
25from bzrlib.tests import TestCase
26
27
28class TestBigString(TestCase):
29
30 def test_repr(self):
31 bs = BigString()
32 self.assertThat(repr(bs),
33 StartsWith('BigString('))
34
35 def test_simple_get_chunks(self):
36 bs = BigString()
37 bs.append_chunks(['hello', 'world'])
38 self.assertEqual(''.join(bs.get_chunks()),
39 'helloworld')
40 self.assertEqual(len(bs), 10)
41
42 def test_get_chunks_limited_size(self):
43 bs = BigString()
44 bs.append_chunks(['hello', 'world', '!'])
45 bs._read_chunk_size = 3
46 self.assertEqual(
47 list(bs.get_chunks()),
48 ['hel', 'low', 'orl', 'd!'])
49 self.assertEqual(
50 len(bs),
51 11)