Bazaar

Merge lp:~mbp/bzr/bigstring into lp:bzr

bigstring
Merge into bzr.dev

Proposed by Martin Pool on 2011-11-29

Status:	Rejected
Rejected by:	Martin Packman on 2012-07-28
Proposed branch:	lp:~mbp/bzr/bigstring
Merge into:	lp:bzr
Diff against target:	236 lines (+162/-4) 5 files modified bzrlib/bigstring.py (+90/-0) bzrlib/groupcompress.py (+18/-4) bzrlib/help_topics/en/debug-flags.txt (+1/-0) bzrlib/tests/__init__.py (+2/-0) bzrlib/tests/test_bigstring.py (+51/-0)
To merge this branch:	bzr merge lp:~mbp/bzr/bigstring
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Martin Packman (community)		2011-11-29	Disapprove on 2012-07-28
Review via email: mp+83732@code.launchpad.net

Description of the change

This adds a BigString class that is basically like a list of byte strings that is not bounded by memory.

It is in of bug 890085, memory usage, but it does not yet clearly help with memory usage, so I'm not sure I actually want to merge it. I'm going to propose it here though because it's pretty self-contained.

Revision history for this message

Martin Packman (gz) wrote on 2011-12-02:

#

One thing I'm not crazy about here is in almost all cases we're just going to be writing the data to disk anyway, and we carefully use a compressor that supports incremental encoding, but then need all the output in one list anyway. All to support a needlessly abstract API that wants to deal in streams but provide a length header. So, we hurt the common case by writing to disk twice with this change to avoid fixing the higher levels.

review: Abstain

Revision history for this message

Martin Pool (mbp) wrote on 2011-12-07:

#

On 2 December 2011 22:17, Martin Packman <email address hidden> wrote:
> Review: Abstain
>
> One thing I'm not crazy about here is in almost all cases we're just going to be writing the data to disk anyway, and we carefully use a compressor that supports incremental encoding, but then need all the output in one list anyway. All to support a needlessly abstract API that wants to deal in streams but provide a length header. So, we hurt the common case by writing to disk twice with this change to avoid fixing the higher levels.

Yes, obviously there is some friction in the APIs, and this is not a
magic bullet for fixing that.

Beyond the API, we do have a length-prefixed format, unless we either
can think of some clever way to write multiple chunks in a compatible
way, or we decide to handle large files by a different format flag, or
there is some other way around it (is there?). To write that format
we need to know the length of the compressed data before we write the
header, therefore we need to buffer up all the compressed data. At the
moment we buffer it in memory; we could instead buffer it to disk at
least for large blocks. When there is physical memory free writing to
a temporary file and then erasing it shouldn't be much slower than
just holding it in memory; when physical memory is low it will
probably cope much better.

--
Martin

Revision history for this message

Martin Pool (mbp) wrote on 2011-12-07:

#

...I'm not going to merge this unless it clearly does help; I've just
put it up here because it is fairly self contained.

Revision history for this message

Martin Packman (gz) wrote on 2011-12-07:

#

Yes, having the length prefix in the format does complicate things. The best idea I could come up with was to always write the pack to disk first, even when streaming to a remote location. Then it would be possible to leave log10(uncompressed_len + overhead) bytes at the start, then when finalizing the compressobj seek back and write in the compressed length with leading zeros. That may still be a format change.

Revision history for this message

Martin Pool (mbp) wrote on 2011-12-07:

#

On 8 December 2011 01:46, Martin Packman <email address hidden> wrote:
> Yes, having the length prefix in the format does complicate things. The best idea I could come up with was to always write the pack to disk first, even when streaming to a remote location. Then it would be possible to leave log10(uncompressed_len + overhead) bytes at the start, then when finalizing the compressobj seek back and write in the compressed length with leading zeros. That may still be a format change.

seeking won't work well on a network stream :)

--
Martin

Revision history for this message

Martin Packman (gz) wrote on 2011-12-07:

#

> On 8 December 2011 01:46, Martin Packman <email address hidden> wrote:
> > Yes, having the length prefix in the format does complicate things. The best
> idea I could come up with was to always write the pack to disk first, even
> when streaming to a remote location. Then it would be possible to leave
> log10(uncompressed_len + overhead) bytes at the start, then when finalizing
> the compressobj seek back and write in the compressed length with leading
> zeros. That may still be a format change.
>
> seeking won't work well on a network stream :)

Hence the need to always write to disk first. It means delaying sending any data till the whole compressed block is completed, but that's already the case now as the whole lot is compressed in memory in order to get the size for the header.

Revision history for this message

Vincent Ladeuil (vila) wrote on 2011-12-08:

#

An alternative to backpatching is to send chunks with a marker indicating that there are more to come.
The receiving side can then buffer them (memory or disk) until the last chunk is received.

Revision history for this message

Martin Packman (gz) wrote on 2012-07-28:

#

Without follow up work I don't think it's useful to land this, the code can be refereed to later even if it's not merged.

review: Disapprove

Revision history for this message

Martin Pool (mbp) wrote on 2012-07-30:

#

I agree (regretfully).

Unmerged revisions

6262. By Martin Pool on 2011-11-29: Remove stray pdb call
6261. By Martin Pool on 2011-11-14: Show the bigstring temp file name
6260. By Martin Pool on 2011-11-14: Use BigString from groupcompress and add some debug help
6259. By Martin Pool on 2011-11-14: Add basic BigString implementation

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

Karl Bielefeldt

Mahmoud Hassan

Martin Pool

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023