Merge lp:~jameinel/bzr/1.16-commit-fulltext into lp:~bzr/bzr/trunk-old

Proposed by John A Meinel on 2009-06-02
Status: Superseded
Proposed branch: lp:~jameinel/bzr/1.16-commit-fulltext
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 485 lines
To merge this branch: bzr merge lp:~jameinel/bzr/1.16-commit-fulltext
Reviewer Review Type Date Requested Status
bzr-core 2009-06-02 Pending
Review via email: mp+6988@code.launchpad.net

This proposal has been superseded by a proposal from 2009-06-04.

To post a comment you must log in.
John A Meinel (jameinel) wrote :

This branch adds a new api, VersionedFiles.add_text(). If people really want, I could change it to VF.add_chunks(), but add_text() fits what I needed, and was expedient.

The main effect is to change 'bzr commit' to use file.read() rather than file.readlines(), and then to pass that on to VF.add_text() rather than VF.add_lines().

It also spends a little bit of time to remove some of the bits that were causing us to copy the memory structures a lot. It doesn't completely remove the need for a list of lines during Knit.commit() but it *does* remove the need during --dev6.commit.

To test this, I created a 90MB file, which consists of mostly 20 byte strings with no final newline. I then did:
rm -rf .bzr; bzr init --format=X; bzr add; time bzr commit -Dmemory -m "bigfile"

For --pack-0.92:
pre 469,748 kB, 5.554s
post 360,836 kB, 4.789s

For --development6-rich-root:
pre 589,732 kB, 7.785s
post 348,796 kB, 5.803s

So it is both faster and smaller. Though I still need to explore why --dev6 isn't more memory friendly. It seems to be because of the DeltaIndex structures as part of Groupcompress blocks. It might be worthwhile to optimize for not creating those on the first insert into a new group. (for the 90MB file, it seems to allocate 97MB for the 'loose' index, and then packs that into a 134MB index that has empty slots.)

Robert Collins (lifeless) wrote :

On Tue, 2009-06-02 at 20:50 +0000, John A Meinel wrote:
> John A Meinel has proposed merging lp:~jameinel/bzr/1.16-commit-fulltext into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This branch adds a new api, VersionedFiles.add_text(). If people really want, I could change it to VF.add_chunks(), but add_text() fits what I needed, and was expedient.

Is it at all possible to use insert_record_stream?

I'd really like to shrink the VF surface area, not increase it.

-Rob

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2009-06-02 at 20:50 +0000, John A Meinel wrote:
>> John A Meinel has proposed merging lp:~jameinel/bzr/1.16-commit-fulltext into lp:bzr.
>>
>> Requested reviews:
>> bzr-core (bzr-core)
>>
>> This branch adds a new api, VersionedFiles.add_text(). If people really want, I could change it to VF.add_chunks(), but add_text() fits what I needed, and was expedient.
>
> Is it at all possible to use insert_record_stream?
>
> I'd really like to shrink the VF surface area, not increase it.
>
> -Rob
>

Not trivially.

1) It is an incompatible api change to insert_record_stream

2) It requires setting up a FulltextContextFactory and passing in a
stream of 1 entry just to add a text, which isn't particularly nice.

3) It requires adding lots of parameters like 'nostore_sha', and
'random_id', etc, onto insert_record_stream

4) It requires rewriting the internals of
KnitVersionedFiles.insert_record_stream to *not* thunk back to
self.add_lines(chunks_to_lines(record.get_bytes_as('chunked')))

5) nostore_sha especially doesn't fit with the theology of
insert_record_stream. It is really only applicable to a single text, and
insert_record_stream is really designed around many texts. Wedging new
parameters onto a function where it doesn't really fit doesn't seem
*better*.

6) As for VF surface area, there is at least a default implementation
that simply thunks over to .add_lines() for those that don't strictly
care about memory performance. (And thus works fine for Weaves, etc.)

In theory we could try to layer it so that we had an 'ongoing' stream,
and 'yield' texts to be inserted as we find them. But that really
doesn't fit 'nostore_sha' since that also needs to be passed in, and
needs to raise an exception which breaks the stream.
Also, I thought we *wanted* commit for groupcompress to not have to do
deltas, and if we stream the texts in, we would spend a modest amount of
time getting poor compression between text files. (Note that we were
already spending that time to compute the delta index, but I have a
patch which fixes that.)

I can understand wanting to shrink the api. If you really push on it,
I'm willing to deprecate .add_lines() and write a .add_chunks() that is
meant to replace it. (since you can .add_chunks(lines) and
.add_chunks([text])) However, chunks fits slightly worse for knits,
since the Content code and annotation and deltas needs lines anyway, and
Groupcompress wants fulltexts...

So if you push hard, I'll try to find the time to do it. But this was
*much* easier.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkol5pQACgkQJdeBCYSNAAMreQCgqQEE98oPL7DgmZnJRDZIf4F7
g2gAn3sZ91KqokOlLPwRI0rUwJpy/F8+
=wAGi
-----END PGP SIGNATURE-----

Robert Collins (lifeless) wrote :
Download full text (3.2 KiB)

On Wed, 2009-06-03 at 03:00 +0000, John A Meinel wrote:
>
> Not trivially.
>
> 1) It is an incompatible api change to insert_record_stream

Yes. Doing this before 2.0 would be better than doing it later.

> 2) It requires setting up a FulltextContextFactory and passing in a
> stream of 1 entry just to add a text, which isn't particularly nice.

record_iter_changes would pass a generator into
texts.insert_record_stream.

e.g.
text_details = \
self.repository.texts.insert_record_stream(self._ric_texts, ...)
for details in text_details:
    [...]

> 3) It requires adding lots of parameters like 'nostore_sha', and
> 'random_id', etc, onto insert_record_stream

or onto the factory. I'm not sure offhand where is best.

> 4) It requires rewriting the internals of
> KnitVersionedFiles.insert_record_stream to *not* thunk back to
> self.add_lines(chunks_to_lines(record.get_bytes_as('chunked')))

this is fairly straightforward: move add_lines to call
self.insert_record_stream appropriately:- I did that for GCVF and it
worked well.

> 5) nostore_sha especially doesn't fit with the theology of
> insert_record_stream. It is really only applicable to a single text,
> and
> insert_record_stream is really designed around many texts. Wedging new
> parameters onto a function where it doesn't really fit doesn't seem
> *better*.

Agreed; so perhaps an attribute on the factory.

> 6) As for VF surface area, there is at least a default implementation
> that simply thunks over to .add_lines() for those that don't strictly
> care about memory performance. (And thus works fine for Weaves, etc.)

Well, I want to delete add_lines as it is.

> In theory we could try to layer it so that we had an 'ongoing' stream,
> and 'yield' texts to be inserted as we find them. But that really
> doesn't fit 'nostore_sha' since that also needs to be passed in, and
> needs to raise an exception which breaks the stream.

I'd yield data per record.

> Also, I thought we *wanted* commit for groupcompress to not have to do
> deltas, and if we stream the texts in, we would spend a modest amount
> of
> time getting poor compression between text files. (Note that we were
> already spending that time to compute the delta index, but I have a
> patch which fixes that.)

It would be good to measure actually... first commit after all suffers
hugely because every page in the CHKMap is add_text'd separately.

> I can understand wanting to shrink the api. If you really push on it,
> I'm willing to deprecate .add_lines() and write a .add_chunks() that
> is
> meant to replace it. (since you can .add_chunks(lines) and
> .add_chunks([text])) However, chunks fits slightly worse for knits,
> since the Content code and annotation and deltas needs lines anyway,
> and
> Groupcompress wants fulltexts...
>
> So if you push hard, I'll try to find the time to do it. But this was
> *much* easier.

I think we're at the point of maturity in bzr that it makes sense to
spend a small amount of time saying 'whats the cleanest way to do X',
and then talk about how to get there.

At the moment, expanding VF's API doesn't seem desirable, or the best
way to be tackling the problem. I think there should be precisely one
...

Read more...

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> I think we're at the point of maturity in bzr that it makes sense to
> spend a small amount of time saying 'whats the cleanest way to do X',
> and then talk about how to get there.
>
> At the moment, expanding VF's API doesn't seem desirable, or the best
> way to be tackling the problem. I think there should be precisely one
> way to add texts to a VF, and that should be as small and fast as we can
> make it.
>
> -Rob
>

We're also blocking on a fairly significant win *today* because of a
potential desire to rewrite a lot of code to make something slightly
cleaner. (Which is something that has been a misfeature of the bzr
project for a *long* time.)

I'm not saying we shouldn't do this, I'm just pointing out the issue.

*For now* I don't feel like rewriting the entire insert_record_stream
stack just to get this in. So I'll leave this pending for now. (More
important is to actually get GC stacking working over bzr+ssh, etc.)

I'm also not sure that getting rid of the "add_this_text_to_the_repo" is
really a net win. Having to write code like:
vf.get_record_stream([one_key], 'unordered',
True).next().get_bytes_as('fulltext')

just to get a single text out is ugly. Not to mention prone to raising
bad exceptions like "AbsentContentFactory has no attribute
.get_bytes_as()", rather than something sane like "NoSuchRevision".
Having to do the same thing during *insert* is just as ugly.

I know you wanted to push people towards multi requests, and I
understand why. I'm not sure that completely removing the convenience
functions is a complete solution, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkome6kACgkQJdeBCYSNAAOEggCgozRvOFoof+3NISA7xnFQM+MU
DzMAn1FRwhU8zJDGubT6O6BjCyfCODfE
=csCo
-----END PGP SIGNATURE-----

Robert Collins (lifeless) wrote :

On Wed, 2009-06-03 at 13:36 +0000, John A Meinel wrote:

Meta: I'm really confused vis-a-vis reviews and blocking. All I've done
here is *ask* your opinion on reusing insert_record_stream and provided
answers to some of the technical issues you see with that. I haven't set
a review status of veto or resubmit - and I don't think I've signalled
in anyway that I would. So I don't know why you're feeling blocked.

> > I think we're at the point of maturity in bzr that it makes sense to
> > spend a small amount of time saying 'whats the cleanest way to do X',
> > and then talk about how to get there.
> >
> > At the moment, expanding VF's API doesn't seem desirable, or the best
> > way to be tackling the problem. I think there should be precisely one
> > way to add texts to a VF, and that should be as small and fast as we can
> > make it.
> >
> > -Rob
> >
>
> We're also blocking on a fairly significant win *today* because of a
> potential desire to rewrite a lot of code to make something slightly
> cleaner. (Which is something that has been a misfeature of the bzr
> project for a *long* time.)

I think we often ask the question - and thats important. Sometimes the
answer is 'yes we should fix the deep issue' and sometimes its 'lets do
it with the least possible changes'. Some things do get stuck, and thats
a shame - I've had that happen to concepts I've proposed, and seen it
happen to other peoples ideas.

> I'm not saying we shouldn't do this, I'm just pointing out the issue.
>
> *For now* I don't feel like rewriting the entire insert_record_stream
> stack just to get this in. So I'll leave this pending for now. (More
> important is to actually get GC stacking working over bzr+ssh, etc.)

I think it would be a good idea to make the new method private then,
because of the open question hanging over it.

> I'm also not sure that getting rid of the "add_this_text_to_the_repo" is
> really a net win. Having to write code like:
> vf.get_record_stream([one_key], 'unordered',
> True).next().get_bytes_as('fulltext')
>
> just to get a single text out is ugly. Not to mention prone to raising
> bad exceptions like "AbsentContentFactory has no attribute
> .get_bytes_as()", rather than something sane like "NoSuchRevision".
> Having to do the same thing during *insert* is just as ugly.

And yet, single read/single write methods are terrible for networking,
and commit over the network is something we currently support - but
can't make even vaguely fast until commit no long uses add_text_*. With
respect to exceptions, we actually do want different exceptions at
different places, so I think it has on balance cleaned some stuff up, in
fact.

> I know you wanted to push people towards multi requests, and I
> understand why. I'm not sure that completely removing the convenience
> functions is a complete solution, though.

I'd like us to get to the point where the core code doesn't do network
hostile things. Beyond that - well, I'm ok if plugins and library users
want to shoot themselves in the foot.

-Rob

Ian Clatworthy (ian-clatworthy) wrote :

Robert Collins wrote:
> On Wed, 2009-06-03 at 13:36 +0000, John A Meinel wrote:
>
>
>> We're also blocking on a fairly significant win *today* because of a
>> potential desire to rewrite a lot of code to make something slightly
>> cleaner. (Which is something that has been a misfeature of the bzr
>> project for a *long* time.)
>>

I agree this is a problem that we need to sort out. I occasionally put
and leave useful code in plugins simply because it can take weeks of
effort/debate to get APIs extended in bzrlib. If it only takes a few
hours to write the methods in the first place, it's more productive for
me to just leave the code out of the core and cut-and-paste it when I
need it again.

> I think we often ask the question - and thats important. Sometimes the
> answer is 'yes we should fix the deep issue' and sometimes its 'lets do
> it with the least possible changes'. Some things do get stuck, and thats
> a shame - I've had that happen to concepts I've proposed, and seen it
> happen to other peoples ideas.
>
>

I agree it's really important to ask the questions. That's the whole
point of reviews.

>> *For now* I don't feel like rewriting the entire insert_record_stream
>> stack just to get this in. So I'll leave this pending for now. (More
>> important is to actually get GC stacking working over bzr+ssh, etc.)
>>
>
> I think it would be a good idea to make the new method private then,
> because of the open question hanging over it.
>
>

That sounds like a reasonable compromise. The other way to look at the
problem though is this:

  "Is this new API a step forward with medium-to-long term value?"

> I'd like us to get to the point where the core code doesn't do network
> hostile things. Beyond that - well, I'm ok if plugins and library users
> want to shoot themselves in the foot.
>

Right. But there are genuine use cases for having easy-to-use,
appropriate-locally-only APIs, e.g. import tools. I see no problems with
having such APIs *provided* the docstrings point the reader to more
network-friendly alternatives.

FWIW, if John's proposed API is faster than the current commonly-used
one, then it sounds like a one-or-two line change to fast-import for me
to take advantage of it. I appreciate that you want fast-import moving
towards using CommitBuilder instead of it's own CommitImporter class but
that's a much bigger change (and it's some time away).

Ian C.

Robert Collins (lifeless) wrote :

On Thu, 2009-06-04 at 04:09 +0000, Ian Clatworthy wrote:
> Robert Collins wrote:
> > On Wed, 2009-06-03 at 13:36 +0000, John A Meinel wrote:
> >
> >
> >> We're also blocking on a fairly significant win *today* because of a
> >> potential desire to rewrite a lot of code to make something slightly
> >> cleaner. (Which is something that has been a misfeature of the bzr
> >> project for a *long* time.)
> >>
>
> I agree this is a problem that we need to sort out. I occasionally put
> and leave useful code in plugins simply because it can take weeks of
> effort/debate to get APIs extended in bzrlib. If it only takes a few
> hours to write the methods in the first place, it's more productive for
> me to just leave the code out of the core and cut-and-paste it when I
> need it again.

We don't have a good place for experiments ' in core'. And one possible
answer is that we don't need one - thats what we have plugins for. For
instance, I note that your revno cache got rewritten to be significantly
different as you learnt more about the problem. I think this is healthy,
as long as you don't get blocked.

> >> *For now* I don't feel like rewriting the entire insert_record_stream
> >> stack just to get this in. So I'll leave this pending for now. (More
> >> important is to actually get GC stacking working over bzr+ssh, etc.)
> >>
> >
> > I think it would be a good idea to make the new method private then,
> > because of the open question hanging over it.
> >
> >
>
> That sounds like a reasonable compromise. The other way to look at the
> problem though is this:
>
> "Is this new API a step forward with medium-to-long term value?"

I think thats what the design aspect of the review seeks to answer; but
its often hard to tell.

> > I'd like us to get to the point where the core code doesn't do network
> > hostile things. Beyond that - well, I'm ok if plugins and library users
> > want to shoot themselves in the foot.
> >
>
> Right. But there are genuine use cases for having easy-to-use,
> appropriate-locally-only APIs, e.g. import tools. I see no problems with
> having such APIs *provided* the docstrings point the reader to more
> network-friendly alternatives.

In this particular case I'd like to have them as adapters; such as
versionedfile.add_text(versioned_files, bytes, ....):
    for details in
versioned_files.insert_record_stream([FullTextContentFactory(bytes, ...)])
        return details

or whatever. That would separate them cleanly from the core API, prevent
them varying per implementation (easing testing) and make them not the
default way of working.

> FWIW, if John's proposed API is faster than the current commonly-used
> one, then it sounds like a one-or-two line change to fast-import for me
> to take advantage of it. I appreciate that you want fast-import moving
> towards using CommitBuilder instead of it's own CommitImporter class but
> that's a much bigger change (and it's some time away).

I think it would be fine to use a private method in fast-import:
fast-import is trying for maximum speed, and you are keeping a close eye
on it.

-Rob

4404. By John A Meinel on 2009-06-04

Switch the api from VF.add_text to VF._add_text and trim some extra 'features'.

Commit won't be using parent_texts or left_matching_blocks or check_content.
And for something like fast-import, it will be tuned for GC repositories anyway,
and there we don't need parent_texts anyway.

4405. By John A Meinel on 2009-06-05

Merge bzr.dev 4413, bringing in the no-delta-index code.

4406. By John A Meinel on 2009-06-22

Respond to Andrew's review comments.

4407. By John A Meinel on 2009-06-22

Merge bzr.dev 4467 in prep for updating NEWS

4408. By John A Meinel on 2009-06-22

Add a NEWS entry documenting the relation to bug #109114

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py 2009-05-29 10:25:37 +0000
+++ bzrlib/groupcompress.py 2009-06-04 20:35:40 +0000
@@ -992,6 +992,26 @@
992 nostore_sha=nostore_sha))[0]992 nostore_sha=nostore_sha))[0]
993 return sha1, length, None993 return sha1, length, None
994994
995 def add_text(self, key, parents, text, parent_texts=None,
996 nostore_sha=None, random_id=False,
997 check_content=True):
998 """See VersionedFiles.add_text()."""
999 self._index._check_write_ok()
1000 self._check_add(key, None, random_id, check_content=False)
1001 if text.__class__ is not str:
1002 raise errors.BzrBadParameterUnicode("text")
1003 if parents is None:
1004 # The caller might pass None if there is no graph data, but kndx
1005 # indexes can't directly store that, so we give them
1006 # an empty tuple instead.
1007 parents = ()
1008 # double handling for now. Make it work until then.
1009 length = len(text)
1010 record = FulltextContentFactory(key, parents, None, text)
1011 sha1 = list(self._insert_record_stream([record], random_id=random_id,
1012 nostore_sha=nostore_sha))[0]
1013 return sha1, length, None
1014
995 def add_fallback_versioned_files(self, a_versioned_files):1015 def add_fallback_versioned_files(self, a_versioned_files):
996 """Add a source of texts for texts not present in this knit.1016 """Add a source of texts for texts not present in this knit.
9971017
@@ -1597,7 +1617,7 @@
1597 if refs:1617 if refs:
1598 for ref in refs:1618 for ref in refs:
1599 if ref:1619 if ref:
1600 raise KnitCorrupt(self,1620 raise errors.KnitCorrupt(self,
1601 "attempt to add node with parents "1621 "attempt to add node with parents "
1602 "in parentless index.")1622 "in parentless index.")
1603 refs = ()1623 refs = ()
16041624
=== modified file 'bzrlib/knit.py'
--- bzrlib/knit.py 2009-05-29 10:25:37 +0000
+++ bzrlib/knit.py 2009-06-04 20:35:40 +0000
@@ -909,18 +909,37 @@
909 # indexes can't directly store that, so we give them909 # indexes can't directly store that, so we give them
910 # an empty tuple instead.910 # an empty tuple instead.
911 parents = ()911 parents = ()
912 line_bytes = ''.join(lines)
912 return self._add(key, lines, parents,913 return self._add(key, lines, parents,
913 parent_texts, left_matching_blocks, nostore_sha, random_id)914 parent_texts, left_matching_blocks, nostore_sha, random_id,
915 line_bytes=line_bytes)
916
917 def add_text(self, key, parents, text, parent_texts=None,
918 nostore_sha=None, random_id=False,
919 check_content=True):
920 """See VersionedFiles.add_text()."""
921 self._index._check_write_ok()
922 self._check_add(key, None, random_id, check_content=False)
923 if text.__class__ is not str:
924 raise errors.BzrBadParameterUnicode("text")
925 if parents is None:
926 # The caller might pass None if there is no graph data, but kndx
927 # indexes can't directly store that, so we give them
928 # an empty tuple instead.
929 parents = ()
930 return self._add(key, None, parents,
931 parent_texts, None, nostore_sha, random_id,
932 line_bytes=text)
914933
915 def _add(self, key, lines, parents, parent_texts,934 def _add(self, key, lines, parents, parent_texts,
916 left_matching_blocks, nostore_sha, random_id):935 left_matching_blocks, nostore_sha, random_id,
936 line_bytes):
917 """Add a set of lines on top of version specified by parents.937 """Add a set of lines on top of version specified by parents.
918938
919 Any versions not present will be converted into ghosts.939 Any versions not present will be converted into ghosts.
920 """940 """
921 # first thing, if the content is something we don't need to store, find941 # first thing, if the content is something we don't need to store, find
922 # that out.942 # that out.
923 line_bytes = ''.join(lines)
924 digest = sha_string(line_bytes)943 digest = sha_string(line_bytes)
925 if nostore_sha == digest:944 if nostore_sha == digest:
926 raise errors.ExistingContent945 raise errors.ExistingContent
@@ -947,13 +966,22 @@
947966
948 text_length = len(line_bytes)967 text_length = len(line_bytes)
949 options = []968 options = []
950 if lines:969 no_eol = False
951 if lines[-1][-1] != '\n':970 # Note: line_bytes is not modified to add a newline, that is tracked
952 # copy the contents of lines.971 # via the no_eol flag. 'lines' *is* modified, because that is the
972 # general values needed by the Content code.
973 if line_bytes and line_bytes[-1] != '\n':
974 options.append('no-eol')
975 no_eol = True
976 # Copy the existing list, or create a new one
977 if lines is None:
978 lines = osutils.split_lines(line_bytes)
979 else:
953 lines = lines[:]980 lines = lines[:]
954 options.append('no-eol')981 # Replace the last line with one that ends in a final newline
955 lines[-1] = lines[-1] + '\n'982 lines[-1] = lines[-1] + '\n'
956 line_bytes += '\n'983 if lines is None:
984 lines = osutils.split_lines(line_bytes)
957985
958 for element in key[:-1]:986 for element in key[:-1]:
959 if type(element) != str:987 if type(element) != str:
@@ -965,7 +993,7 @@
965 # Knit hunks are still last-element only993 # Knit hunks are still last-element only
966 version_id = key[-1]994 version_id = key[-1]
967 content = self._factory.make(lines, version_id)995 content = self._factory.make(lines, version_id)
968 if 'no-eol' in options:996 if no_eol:
969 # Hint to the content object that its text() call should strip the997 # Hint to the content object that its text() call should strip the
970 # EOL.998 # EOL.
971 content._should_strip_eol = True999 content._should_strip_eol = True
@@ -986,8 +1014,11 @@
986 if self._factory.__class__ is KnitPlainFactory:1014 if self._factory.__class__ is KnitPlainFactory:
987 # Use the already joined bytes saving iteration time in1015 # Use the already joined bytes saving iteration time in
988 # _record_to_data.1016 # _record_to_data.
1017 dense_lines = [line_bytes]
1018 if no_eol:
1019 dense_lines.append('\n')
989 size, bytes = self._record_to_data(key, digest,1020 size, bytes = self._record_to_data(key, digest,
990 lines, [line_bytes])1021 lines, dense_lines)
991 else:1022 else:
992 # get mixed annotation + content and feed it into the1023 # get mixed annotation + content and feed it into the
993 # serialiser.1024 # serialiser.
@@ -1920,21 +1951,16 @@
1920 function spends less time resizing the final string.1951 function spends less time resizing the final string.
1921 :return: (len, a StringIO instance with the raw data ready to read.)1952 :return: (len, a StringIO instance with the raw data ready to read.)
1922 """1953 """
1923 # Note: using a string copy here increases memory pressure with e.g.1954 chunks = ["version %s %d %s\n" % (key[-1], len(lines), digest)]
1924 # ISO's, but it is about 3 seconds faster on a 1.2Ghz intel machine1955 chunks.extend(dense_lines or lines)
1925 # when doing the initial commit of a mozilla tree. RBC 200709211956 chunks.append("end %s\n" % key[-1])
1926 bytes = ''.join(chain(1957 for chunk in chunks:
1927 ["version %s %d %s\n" % (key[-1],1958 if type(chunk) != str:
1928 len(lines),1959 raise AssertionError(
1929 digest)],1960 'data must be plain bytes was %s' % type(chunk))
1930 dense_lines or lines,
1931 ["end %s\n" % key[-1]]))
1932 if type(bytes) != str:
1933 raise AssertionError(
1934 'data must be plain bytes was %s' % type(bytes))
1935 if lines and lines[-1][-1] != '\n':1961 if lines and lines[-1][-1] != '\n':
1936 raise ValueError('corrupt lines value %r' % lines)1962 raise ValueError('corrupt lines value %r' % lines)
1937 compressed_bytes = tuned_gzip.bytes_to_gzip(bytes)1963 compressed_bytes = tuned_gzip.chunks_to_gzip(chunks)
1938 return len(compressed_bytes), compressed_bytes1964 return len(compressed_bytes), compressed_bytes
19391965
1940 def _split_header(self, line):1966 def _split_header(self, line):
19411967
=== modified file 'bzrlib/repository.py'
--- bzrlib/repository.py 2009-06-03 21:31:43 +0000
+++ bzrlib/repository.py 2009-06-04 20:35:40 +0000
@@ -494,12 +494,12 @@
494 ie.executable = content_summary[2]494 ie.executable = content_summary[2]
495 file_obj, stat_value = tree.get_file_with_stat(ie.file_id, path)495 file_obj, stat_value = tree.get_file_with_stat(ie.file_id, path)
496 try:496 try:
497 lines = file_obj.readlines()497 text = file_obj.read()
498 finally:498 finally:
499 file_obj.close()499 file_obj.close()
500 try:500 try:
501 ie.text_sha1, ie.text_size = self._add_text_to_weave(501 ie.text_sha1, ie.text_size = self._add_text_to_weave(
502 ie.file_id, lines, heads, nostore_sha)502 ie.file_id, text, heads, nostore_sha)
503 # Let the caller know we generated a stat fingerprint.503 # Let the caller know we generated a stat fingerprint.
504 fingerprint = (ie.text_sha1, stat_value)504 fingerprint = (ie.text_sha1, stat_value)
505 except errors.ExistingContent:505 except errors.ExistingContent:
@@ -517,8 +517,7 @@
517 # carry over:517 # carry over:
518 ie.revision = parent_entry.revision518 ie.revision = parent_entry.revision
519 return self._get_delta(ie, basis_inv, path), False, None519 return self._get_delta(ie, basis_inv, path), False, None
520 lines = []520 self._add_text_to_weave(ie.file_id, '', heads, None)
521 self._add_text_to_weave(ie.file_id, lines, heads, None)
522 elif kind == 'symlink':521 elif kind == 'symlink':
523 current_link_target = content_summary[3]522 current_link_target = content_summary[3]
524 if not store:523 if not store:
@@ -532,8 +531,7 @@
532 ie.symlink_target = parent_entry.symlink_target531 ie.symlink_target = parent_entry.symlink_target
533 return self._get_delta(ie, basis_inv, path), False, None532 return self._get_delta(ie, basis_inv, path), False, None
534 ie.symlink_target = current_link_target533 ie.symlink_target = current_link_target
535 lines = []534 self._add_text_to_weave(ie.file_id, '', heads, None)
536 self._add_text_to_weave(ie.file_id, lines, heads, None)
537 elif kind == 'tree-reference':535 elif kind == 'tree-reference':
538 if not store:536 if not store:
539 if content_summary[3] != parent_entry.reference_revision:537 if content_summary[3] != parent_entry.reference_revision:
@@ -544,8 +542,7 @@
544 ie.revision = parent_entry.revision542 ie.revision = parent_entry.revision
545 return self._get_delta(ie, basis_inv, path), False, None543 return self._get_delta(ie, basis_inv, path), False, None
546 ie.reference_revision = content_summary[3]544 ie.reference_revision = content_summary[3]
547 lines = []545 self._add_text_to_weave(ie.file_id, '', heads, None)
548 self._add_text_to_weave(ie.file_id, lines, heads, None)
549 else:546 else:
550 raise NotImplementedError('unknown kind')547 raise NotImplementedError('unknown kind')
551 ie.revision = self._new_revision_id548 ie.revision = self._new_revision_id
@@ -745,7 +742,7 @@
745 entry.executable = True742 entry.executable = True
746 else:743 else:
747 entry.executable = False744 entry.executable = False
748 if (carry_over_possible and 745 if (carry_over_possible and
749 parent_entry.executable == entry.executable):746 parent_entry.executable == entry.executable):
750 # Check the file length, content hash after reading747 # Check the file length, content hash after reading
751 # the file.748 # the file.
@@ -754,12 +751,12 @@
754 nostore_sha = None751 nostore_sha = None
755 file_obj, stat_value = tree.get_file_with_stat(file_id, change[1][1])752 file_obj, stat_value = tree.get_file_with_stat(file_id, change[1][1])
756 try:753 try:
757 lines = file_obj.readlines()754 text = file_obj.read()
758 finally:755 finally:
759 file_obj.close()756 file_obj.close()
760 try:757 try:
761 entry.text_sha1, entry.text_size = self._add_text_to_weave(758 entry.text_sha1, entry.text_size = self._add_text_to_weave(
762 file_id, lines, heads, nostore_sha)759 file_id, text, heads, nostore_sha)
763 yield file_id, change[1][1], (entry.text_sha1, stat_value)760 yield file_id, change[1][1], (entry.text_sha1, stat_value)
764 except errors.ExistingContent:761 except errors.ExistingContent:
765 # No content change against a carry_over parent762 # No content change against a carry_over parent
@@ -774,7 +771,7 @@
774 parent_entry.symlink_target == entry.symlink_target):771 parent_entry.symlink_target == entry.symlink_target):
775 carried_over = True772 carried_over = True
776 else:773 else:
777 self._add_text_to_weave(change[0], [], heads, None)774 self._add_text_to_weave(change[0], '', heads, None)
778 elif kind == 'directory':775 elif kind == 'directory':
779 if carry_over_possible:776 if carry_over_possible:
780 carried_over = True777 carried_over = True
@@ -782,7 +779,7 @@
782 # Nothing to set on the entry.779 # Nothing to set on the entry.
783 # XXX: split into the Root and nonRoot versions.780 # XXX: split into the Root and nonRoot versions.
784 if change[1][1] != '' or self.repository.supports_rich_root():781 if change[1][1] != '' or self.repository.supports_rich_root():
785 self._add_text_to_weave(change[0], [], heads, None)782 self._add_text_to_weave(change[0], '', heads, None)
786 elif kind == 'tree-reference':783 elif kind == 'tree-reference':
787 if not self.repository._format.supports_tree_reference:784 if not self.repository._format.supports_tree_reference:
788 # This isn't quite sane as an error, but we shouldn't785 # This isn't quite sane as an error, but we shouldn't
@@ -797,7 +794,7 @@
797 parent_entry.reference_revision == reference_revision):794 parent_entry.reference_revision == reference_revision):
798 carried_over = True795 carried_over = True
799 else:796 else:
800 self._add_text_to_weave(change[0], [], heads, None)797 self._add_text_to_weave(change[0], '', heads, None)
801 else:798 else:
802 raise AssertionError('unknown kind %r' % kind)799 raise AssertionError('unknown kind %r' % kind)
803 if not carried_over:800 if not carried_over:
@@ -818,15 +815,15 @@
818 self._require_root_change(tree)815 self._require_root_change(tree)
819 self.basis_delta_revision = basis_revision_id816 self.basis_delta_revision = basis_revision_id
820817
821 def _add_text_to_weave(self, file_id, new_lines, parents, nostore_sha):818 def _add_text_to_weave(self, file_id, new_text, parents, nostore_sha):
822 # Note: as we read the content directly from the tree, we know its not819 # Note: as we read the content directly from the tree, we know its not
823 # been turned into unicode or badly split - but a broken tree820 # been turned into unicode or badly split - but a broken tree
824 # implementation could give us bad output from readlines() so this is821 # implementation could give us bad output from readlines() so this is
825 # not a guarantee of safety. What would be better is always checking822 # not a guarantee of safety. What would be better is always checking
826 # the content during test suite execution. RBC 20070912823 # the content during test suite execution. RBC 20070912
827 parent_keys = tuple((file_id, parent) for parent in parents)824 parent_keys = tuple((file_id, parent) for parent in parents)
828 return self.repository.texts.add_lines(825 return self.repository.texts.add_text(
829 (file_id, self._new_revision_id), parent_keys, new_lines,826 (file_id, self._new_revision_id), parent_keys, new_text,
830 nostore_sha=nostore_sha, random_id=self.random_revid,827 nostore_sha=nostore_sha, random_id=self.random_revid,
831 check_content=False)[0:2]828 check_content=False)[0:2]
832829
833830
=== modified file 'bzrlib/tests/test_tuned_gzip.py'
--- bzrlib/tests/test_tuned_gzip.py 2009-03-23 14:59:43 +0000
+++ bzrlib/tests/test_tuned_gzip.py 2009-06-04 20:35:40 +0000
@@ -85,3 +85,28 @@
85 self.assertEqual('', stream.read())85 self.assertEqual('', stream.read())
86 # and it should be new member time in the stream.86 # and it should be new member time in the stream.
87 self.failUnless(myfile._new_member)87 self.failUnless(myfile._new_member)
88
89
90class TestToGzip(TestCase):
91
92 def assertToGzip(self, chunks):
93 bytes = ''.join(chunks)
94 gzfromchunks = tuned_gzip.chunks_to_gzip(chunks)
95 gzfrombytes = tuned_gzip.bytes_to_gzip(bytes)
96 self.assertEqual(gzfrombytes, gzfromchunks)
97 decoded = tuned_gzip.GzipFile(fileobj=StringIO(gzfromchunks)).read()
98 self.assertEqual(bytes, decoded)
99
100 def test_single_chunk(self):
101 self.assertToGzip(['a modest chunk\nwith some various\nbits\n'])
102
103 def test_simple_text(self):
104 self.assertToGzip(['some\n', 'strings\n', 'to\n', 'process\n'])
105
106 def test_large_chunks(self):
107 self.assertToGzip(['a large string\n'*1024])
108 self.assertToGzip(['a large string\n']*1024)
109
110 def test_enormous_chunks(self):
111 self.assertToGzip(['a large string\n'*1024*256])
112 self.assertToGzip(['a large string\n']*1024*256)
88113
=== modified file 'bzrlib/tests/test_versionedfile.py'
--- bzrlib/tests/test_versionedfile.py 2009-05-01 18:09:24 +0000
+++ bzrlib/tests/test_versionedfile.py 2009-06-04 20:35:40 +0000
@@ -1471,6 +1471,58 @@
1471 self.addCleanup(lambda:self.cleanup(files))1471 self.addCleanup(lambda:self.cleanup(files))
1472 return files1472 return files
14731473
1474 def test_add_lines(self):
1475 f = self.get_versionedfiles()
1476 if self.key_length == 1:
1477 key0 = ('r0',)
1478 key1 = ('r1',)
1479 key2 = ('r2',)
1480 keyf = ('foo',)
1481 else:
1482 key0 = ('fid', 'r0')
1483 key1 = ('fid', 'r1')
1484 key2 = ('fid', 'r2')
1485 keyf = ('fid', 'foo')
1486 f.add_lines(key0, [], ['a\n', 'b\n'])
1487 if self.graph:
1488 f.add_lines(key1, [key0], ['b\n', 'c\n'])
1489 else:
1490 f.add_lines(key1, [], ['b\n', 'c\n'])
1491 keys = f.keys()
1492 self.assertTrue(key0 in keys)
1493 self.assertTrue(key1 in keys)
1494 records = []
1495 for record in f.get_record_stream([key0, key1], 'unordered', True):
1496 records.append((record.key, record.get_bytes_as('fulltext')))
1497 records.sort()
1498 self.assertEqual([(key0, 'a\nb\n'), (key1, 'b\nc\n')], records)
1499
1500 def test_add_text(self):
1501 f = self.get_versionedfiles()
1502 if self.key_length == 1:
1503 key0 = ('r0',)
1504 key1 = ('r1',)
1505 key2 = ('r2',)
1506 keyf = ('foo',)
1507 else:
1508 key0 = ('fid', 'r0')
1509 key1 = ('fid', 'r1')
1510 key2 = ('fid', 'r2')
1511 keyf = ('fid', 'foo')
1512 f.add_text(key0, [], 'a\nb\n')
1513 if self.graph:
1514 f.add_text(key1, [key0], 'b\nc\n')
1515 else:
1516 f.add_text(key1, [], 'b\nc\n')
1517 keys = f.keys()
1518 self.assertTrue(key0 in keys)
1519 self.assertTrue(key1 in keys)
1520 records = []
1521 for record in f.get_record_stream([key0, key1], 'unordered', True):
1522 records.append((record.key, record.get_bytes_as('fulltext')))
1523 records.sort()
1524 self.assertEqual([(key0, 'a\nb\n'), (key1, 'b\nc\n')], records)
1525
1474 def test_annotate(self):1526 def test_annotate(self):
1475 files = self.get_versionedfiles()1527 files = self.get_versionedfiles()
1476 self.get_diamond_files(files)1528 self.get_diamond_files(files)
@@ -1520,7 +1572,7 @@
1520 trailing_eol=trailing_eol, nograph=not self.graph,1572 trailing_eol=trailing_eol, nograph=not self.graph,
1521 left_only=left_only, nokeys=nokeys)1573 left_only=left_only, nokeys=nokeys)
15221574
1523 def test_add_lines_nostoresha(self):1575 def _add_content_nostoresha(self, add_lines):
1524 """When nostore_sha is supplied using old content raises."""1576 """When nostore_sha is supplied using old content raises."""
1525 vf = self.get_versionedfiles()1577 vf = self.get_versionedfiles()
1526 empty_text = ('a', [])1578 empty_text = ('a', [])
@@ -1528,7 +1580,12 @@
1528 sample_text_no_nl = ('c', ["foo\n", "bar"])1580 sample_text_no_nl = ('c', ["foo\n", "bar"])
1529 shas = []1581 shas = []
1530 for version, lines in (empty_text, sample_text_nl, sample_text_no_nl):1582 for version, lines in (empty_text, sample_text_nl, sample_text_no_nl):
1531 sha, _, _ = vf.add_lines(self.get_simple_key(version), [], lines)1583 if add_lines:
1584 sha, _, _ = vf.add_lines(self.get_simple_key(version), [],
1585 lines)
1586 else:
1587 sha, _, _ = vf.add_text(self.get_simple_key(version), [],
1588 ''.join(lines))
1532 shas.append(sha)1589 shas.append(sha)
1533 # we now have a copy of all the lines in the vf.1590 # we now have a copy of all the lines in the vf.
1534 for sha, (version, lines) in zip(1591 for sha, (version, lines) in zip(
@@ -1537,10 +1594,19 @@
1537 self.assertRaises(errors.ExistingContent,1594 self.assertRaises(errors.ExistingContent,
1538 vf.add_lines, new_key, [], lines,1595 vf.add_lines, new_key, [], lines,
1539 nostore_sha=sha)1596 nostore_sha=sha)
1597 self.assertRaises(errors.ExistingContent,
1598 vf.add_text, new_key, [], ''.join(lines),
1599 nostore_sha=sha)
1540 # and no new version should have been added.1600 # and no new version should have been added.
1541 record = vf.get_record_stream([new_key], 'unordered', True).next()1601 record = vf.get_record_stream([new_key], 'unordered', True).next()
1542 self.assertEqual('absent', record.storage_kind)1602 self.assertEqual('absent', record.storage_kind)
15431603
1604 def test_add_lines_nostoresha(self):
1605 self._add_content_nostoresha(add_lines=True)
1606
1607 def test_add_text_nostoresha(self):
1608 self._add_content_nostoresha(add_lines=False)
1609
1544 def test_add_lines_return(self):1610 def test_add_lines_return(self):
1545 files = self.get_versionedfiles()1611 files = self.get_versionedfiles()
1546 # save code by using the stock data insertion helper.1612 # save code by using the stock data insertion helper.
15471613
=== modified file 'bzrlib/tuned_gzip.py'
--- bzrlib/tuned_gzip.py 2009-03-23 14:59:43 +0000
+++ bzrlib/tuned_gzip.py 2009-06-04 20:35:40 +0000
@@ -52,6 +52,18 @@
52 width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL,52 width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL,
53 crc32=zlib.crc32):53 crc32=zlib.crc32):
54 """Create a gzip file containing bytes and return its content."""54 """Create a gzip file containing bytes and return its content."""
55 return chunks_to_gzip([bytes])
56
57
58def chunks_to_gzip(chunks, factory=zlib.compressobj,
59 level=zlib.Z_DEFAULT_COMPRESSION, method=zlib.DEFLATED,
60 width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL,
61 crc32=zlib.crc32):
62 """Create a gzip file containing chunks and return its content.
63
64 :param chunks: An iterable of strings. Each string can have arbitrary
65 layout.
66 """
55 result = [67 result = [
56 '\037\213' # self.fileobj.write('\037\213') # magic header68 '\037\213' # self.fileobj.write('\037\213') # magic header
57 '\010' # self.fileobj.write('\010') # compression method69 '\010' # self.fileobj.write('\010') # compression method
@@ -69,11 +81,17 @@
69 # using a compressobj avoids a small header and trailer that the compress()81 # using a compressobj avoids a small header and trailer that the compress()
70 # utility function adds.82 # utility function adds.
71 compress = factory(level, method, width, mem, 0)83 compress = factory(level, method, width, mem, 0)
72 result.append(compress.compress(bytes))84 crc = 0
85 total_len = 0
86 for chunk in chunks:
87 crc = crc32(chunk, crc)
88 total_len += len(chunk)
89 zbytes = compress.compress(chunk)
90 if zbytes:
91 result.append(zbytes)
73 result.append(compress.flush())92 result.append(compress.flush())
74 result.append(struct.pack("<L", LOWU32(crc32(bytes))))
75 # size may exceed 2GB, or even 4GB93 # size may exceed 2GB, or even 4GB
76 result.append(struct.pack("<L", LOWU32(len(bytes))))94 result.append(struct.pack("<LL", LOWU32(crc), LOWU32(total_len)))
77 return ''.join(result)95 return ''.join(result)
7896
7997
8098
=== modified file 'bzrlib/versionedfile.py'
--- bzrlib/versionedfile.py 2009-04-29 17:02:36 +0000
+++ bzrlib/versionedfile.py 2009-06-04 20:35:40 +0000
@@ -829,6 +829,14 @@
829 """829 """
830 raise NotImplementedError(self.add_lines)830 raise NotImplementedError(self.add_lines)
831831
832 def add_text(self, key, parents, text, parent_texts=None,
833 nostore_sha=None, random_id=False, check_content=True):
834 return self.add_lines(key, parents, osutils.split_lines(text),
835 parent_texts=parent_texts,
836 nostore_sha=nostore_sha,
837 random_id=random_id,
838 check_content=check_content)
839
832 def add_mpdiffs(self, records):840 def add_mpdiffs(self, records):
833 """Add mpdiffs to this VersionedFile.841 """Add mpdiffs to this VersionedFile.
834842