Merge lp:~jameinel/bzr/1.16-commit-fulltext into lp:~bzr/bzr/trunk-old
- 1.16-commit-fulltext
- Merge into trunk-old
Status: | Merged |
---|---|
Merged at revision: | not available |
Proposed branch: | lp:~jameinel/bzr/1.16-commit-fulltext |
Merge into: | lp:~bzr/bzr/trunk-old |
Diff against target: | 507 lines |
To merge this branch: | bzr merge lp:~jameinel/bzr/1.16-commit-fulltext |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Andrew Bennetts | Approve | ||
Review via email: mp+7080@code.launchpad.net |
This proposal supersedes a proposal from 2009-06-02.
Commit message
Description of the change
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal | # |
On Tue, 2009-06-02 at 20:50 +0000, John A Meinel wrote:
> John A Meinel has proposed merging lp:~jameinel/bzr/1.16-commit-fulltext into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This branch adds a new api, VersionedFiles.
Is it at all possible to use insert_
I'd really like to shrink the VF surface area, not increase it.
-Rob
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> On Tue, 2009-06-02 at 20:50 +0000, John A Meinel wrote:
>> John A Meinel has proposed merging lp:~jameinel/bzr/1.16-commit-fulltext into lp:bzr.
>>
>> Requested reviews:
>> bzr-core (bzr-core)
>>
>> This branch adds a new api, VersionedFiles.
>
> Is it at all possible to use insert_
>
> I'd really like to shrink the VF surface area, not increase it.
>
> -Rob
>
Not trivially.
1) It is an incompatible api change to insert_
2) It requires setting up a FulltextContext
stream of 1 entry just to add a text, which isn't particularly nice.
3) It requires adding lots of parameters like 'nostore_sha', and
'random_id', etc, onto insert_
4) It requires rewriting the internals of
KnitVersionedFi
self.add_
5) nostore_sha especially doesn't fit with the theology of
insert_
insert_
parameters onto a function where it doesn't really fit doesn't seem
*better*.
6) As for VF surface area, there is at least a default implementation
that simply thunks over to .add_lines() for those that don't strictly
care about memory performance. (And thus works fine for Weaves, etc.)
In theory we could try to layer it so that we had an 'ongoing' stream,
and 'yield' texts to be inserted as we find them. But that really
doesn't fit 'nostore_sha' since that also needs to be passed in, and
needs to raise an exception which breaks the stream.
Also, I thought we *wanted* commit for groupcompress to not have to do
deltas, and if we stream the texts in, we would spend a modest amount of
time getting poor compression between text files. (Note that we were
already spending that time to compute the delta index, but I have a
patch which fixes that.)
I can understand wanting to shrink the api. If you really push on it,
I'm willing to deprecate .add_lines() and write a .add_chunks() that is
meant to replace it. (since you can .add_chunks(lines) and
.add_chunks(
since the Content code and annotation and deltas needs lines anyway, and
Groupcompress wants fulltexts...
So if you push hard, I'll try to find the time to do it. But this was
*much* easier.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAko
g2gAn3sZ91KqokO
=wAGi
-----END PGP SIGNATURE-----
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal | # |
On Wed, 2009-06-03 at 03:00 +0000, John A Meinel wrote:
>
> Not trivially.
>
> 1) It is an incompatible api change to insert_
Yes. Doing this before 2.0 would be better than doing it later.
> 2) It requires setting up a FulltextContext
> stream of 1 entry just to add a text, which isn't particularly nice.
record_iter_changes would pass a generator into
texts.insert_
e.g.
text_details = \
self.repository
for details in text_details:
[...]
> 3) It requires adding lots of parameters like 'nostore_sha', and
> 'random_id', etc, onto insert_
or onto the factory. I'm not sure offhand where is best.
> 4) It requires rewriting the internals of
> KnitVersionedFi
> self.add_
this is fairly straightforward: move add_lines to call
self.insert_
worked well.
> 5) nostore_sha especially doesn't fit with the theology of
> insert_
> and
> insert_
> parameters onto a function where it doesn't really fit doesn't seem
> *better*.
Agreed; so perhaps an attribute on the factory.
> 6) As for VF surface area, there is at least a default implementation
> that simply thunks over to .add_lines() for those that don't strictly
> care about memory performance. (And thus works fine for Weaves, etc.)
Well, I want to delete add_lines as it is.
> In theory we could try to layer it so that we had an 'ongoing' stream,
> and 'yield' texts to be inserted as we find them. But that really
> doesn't fit 'nostore_sha' since that also needs to be passed in, and
> needs to raise an exception which breaks the stream.
I'd yield data per record.
> Also, I thought we *wanted* commit for groupcompress to not have to do
> deltas, and if we stream the texts in, we would spend a modest amount
> of
> time getting poor compression between text files. (Note that we were
> already spending that time to compute the delta index, but I have a
> patch which fixes that.)
It would be good to measure actually... first commit after all suffers
hugely because every page in the CHKMap is add_text'd separately.
> I can understand wanting to shrink the api. If you really push on it,
> I'm willing to deprecate .add_lines() and write a .add_chunks() that
> is
> meant to replace it. (since you can .add_chunks(lines) and
> .add_chunks(
> since the Content code and annotation and deltas needs lines anyway,
> and
> Groupcompress wants fulltexts...
>
> So if you push hard, I'll try to find the time to do it. But this was
> *much* easier.
I think we're at the point of maturity in bzr that it makes sense to
spend a small amount of time saying 'whats the cleanest way to do X',
and then talk about how to get there.
At the moment, expanding VF's API doesn't seem desirable, or the best
way to be tackling the problem. I think there should be precisely one
...
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
...
> I think we're at the point of maturity in bzr that it makes sense to
> spend a small amount of time saying 'whats the cleanest way to do X',
> and then talk about how to get there.
>
> At the moment, expanding VF's API doesn't seem desirable, or the best
> way to be tackling the problem. I think there should be precisely one
> way to add texts to a VF, and that should be as small and fast as we can
> make it.
>
> -Rob
>
We're also blocking on a fairly significant win *today* because of a
potential desire to rewrite a lot of code to make something slightly
cleaner. (Which is something that has been a misfeature of the bzr
project for a *long* time.)
I'm not saying we shouldn't do this, I'm just pointing out the issue.
*For now* I don't feel like rewriting the entire insert_
stack just to get this in. So I'll leave this pending for now. (More
important is to actually get GC stacking working over bzr+ssh, etc.)
I'm also not sure that getting rid of the "add_this_
really a net win. Having to write code like:
vf.get_
True).next(
just to get a single text out is ugly. Not to mention prone to raising
bad exceptions like "AbsentContentF
.get_bytes_as()", rather than something sane like "NoSuchRevision".
Having to do the same thing during *insert* is just as ugly.
I know you wanted to push people towards multi requests, and I
understand why. I'm not sure that completely removing the convenience
functions is a complete solution, though.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAko
DzMAn1FRwhU8zJD
=csCo
-----END PGP SIGNATURE-----
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal | # |
On Wed, 2009-06-03 at 13:36 +0000, John A Meinel wrote:
Meta: I'm really confused vis-a-vis reviews and blocking. All I've done
here is *ask* your opinion on reusing insert_
answers to some of the technical issues you see with that. I haven't set
a review status of veto or resubmit - and I don't think I've signalled
in anyway that I would. So I don't know why you're feeling blocked.
> > I think we're at the point of maturity in bzr that it makes sense to
> > spend a small amount of time saying 'whats the cleanest way to do X',
> > and then talk about how to get there.
> >
> > At the moment, expanding VF's API doesn't seem desirable, or the best
> > way to be tackling the problem. I think there should be precisely one
> > way to add texts to a VF, and that should be as small and fast as we can
> > make it.
> >
> > -Rob
> >
>
> We're also blocking on a fairly significant win *today* because of a
> potential desire to rewrite a lot of code to make something slightly
> cleaner. (Which is something that has been a misfeature of the bzr
> project for a *long* time.)
I think we often ask the question - and thats important. Sometimes the
answer is 'yes we should fix the deep issue' and sometimes its 'lets do
it with the least possible changes'. Some things do get stuck, and thats
a shame - I've had that happen to concepts I've proposed, and seen it
happen to other peoples ideas.
> I'm not saying we shouldn't do this, I'm just pointing out the issue.
>
> *For now* I don't feel like rewriting the entire insert_
> stack just to get this in. So I'll leave this pending for now. (More
> important is to actually get GC stacking working over bzr+ssh, etc.)
I think it would be a good idea to make the new method private then,
because of the open question hanging over it.
> I'm also not sure that getting rid of the "add_this_
> really a net win. Having to write code like:
> vf.get_
> True).next(
>
> just to get a single text out is ugly. Not to mention prone to raising
> bad exceptions like "AbsentContentF
> .get_bytes_as()", rather than something sane like "NoSuchRevision".
> Having to do the same thing during *insert* is just as ugly.
And yet, single read/single write methods are terrible for networking,
and commit over the network is something we currently support - but
can't make even vaguely fast until commit no long uses add_text_*. With
respect to exceptions, we actually do want different exceptions at
different places, so I think it has on balance cleaned some stuff up, in
fact.
> I know you wanted to push people towards multi requests, and I
> understand why. I'm not sure that completely removing the convenience
> functions is a complete solution, though.
I'd like us to get to the point where the core code doesn't do network
hostile things. Beyond that - well, I'm ok if plugins and library users
want to shoot themselves in the foot.
-Rob
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal | # |
Robert Collins wrote:
> On Wed, 2009-06-03 at 13:36 +0000, John A Meinel wrote:
>
>
>> We're also blocking on a fairly significant win *today* because of a
>> potential desire to rewrite a lot of code to make something slightly
>> cleaner. (Which is something that has been a misfeature of the bzr
>> project for a *long* time.)
>>
I agree this is a problem that we need to sort out. I occasionally put
and leave useful code in plugins simply because it can take weeks of
effort/debate to get APIs extended in bzrlib. If it only takes a few
hours to write the methods in the first place, it's more productive for
me to just leave the code out of the core and cut-and-paste it when I
need it again.
> I think we often ask the question - and thats important. Sometimes the
> answer is 'yes we should fix the deep issue' and sometimes its 'lets do
> it with the least possible changes'. Some things do get stuck, and thats
> a shame - I've had that happen to concepts I've proposed, and seen it
> happen to other peoples ideas.
>
>
I agree it's really important to ask the questions. That's the whole
point of reviews.
>> *For now* I don't feel like rewriting the entire insert_
>> stack just to get this in. So I'll leave this pending for now. (More
>> important is to actually get GC stacking working over bzr+ssh, etc.)
>>
>
> I think it would be a good idea to make the new method private then,
> because of the open question hanging over it.
>
>
That sounds like a reasonable compromise. The other way to look at the
problem though is this:
"Is this new API a step forward with medium-to-long term value?"
> I'd like us to get to the point where the core code doesn't do network
> hostile things. Beyond that - well, I'm ok if plugins and library users
> want to shoot themselves in the foot.
>
Right. But there are genuine use cases for having easy-to-use,
appropriate-
having such APIs *provided* the docstrings point the reader to more
network-friendly alternatives.
FWIW, if John's proposed API is faster than the current commonly-used
one, then it sounds like a one-or-two line change to fast-import for me
to take advantage of it. I appreciate that you want fast-import moving
towards using CommitBuilder instead of it's own CommitImporter class but
that's a much bigger change (and it's some time away).
Ian C.
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal | # |
On Thu, 2009-06-04 at 04:09 +0000, Ian Clatworthy wrote:
> Robert Collins wrote:
> > On Wed, 2009-06-03 at 13:36 +0000, John A Meinel wrote:
> >
> >
> >> We're also blocking on a fairly significant win *today* because of a
> >> potential desire to rewrite a lot of code to make something slightly
> >> cleaner. (Which is something that has been a misfeature of the bzr
> >> project for a *long* time.)
> >>
>
> I agree this is a problem that we need to sort out. I occasionally put
> and leave useful code in plugins simply because it can take weeks of
> effort/debate to get APIs extended in bzrlib. If it only takes a few
> hours to write the methods in the first place, it's more productive for
> me to just leave the code out of the core and cut-and-paste it when I
> need it again.
We don't have a good place for experiments ' in core'. And one possible
answer is that we don't need one - thats what we have plugins for. For
instance, I note that your revno cache got rewritten to be significantly
different as you learnt more about the problem. I think this is healthy,
as long as you don't get blocked.
> >> *For now* I don't feel like rewriting the entire insert_
> >> stack just to get this in. So I'll leave this pending for now. (More
> >> important is to actually get GC stacking working over bzr+ssh, etc.)
> >>
> >
> > I think it would be a good idea to make the new method private then,
> > because of the open question hanging over it.
> >
> >
>
> That sounds like a reasonable compromise. The other way to look at the
> problem though is this:
>
> "Is this new API a step forward with medium-to-long term value?"
I think thats what the design aspect of the review seeks to answer; but
its often hard to tell.
> > I'd like us to get to the point where the core code doesn't do network
> > hostile things. Beyond that - well, I'm ok if plugins and library users
> > want to shoot themselves in the foot.
> >
>
> Right. But there are genuine use cases for having easy-to-use,
> appropriate-
> having such APIs *provided* the docstrings point the reader to more
> network-friendly alternatives.
In this particular case I'd like to have them as adapters; such as
versionedfile.
for details in
versioned_
return details
or whatever. That would separate them cleanly from the core API, prevent
them varying per implementation (easing testing) and make them not the
default way of working.
> FWIW, if John's proposed API is faster than the current commonly-used
> one, then it sounds like a one-or-two line change to fast-import for me
> to take advantage of it. I appreciate that you want fast-import moving
> towards using CommitBuilder instead of it's own CommitImporter class but
> that's a much bigger change (and it's some time away).
I think it would be fine to use a private method in fast-import:
fast-import is trying for maximum speed, and you are keeping a close eye
on it.
-Rob
John A Meinel (jameinel) wrote : | # |
This patch builds on the previous one, and changes things to:
1) VF.add_text becomes VF._add_text
2) Simplifies the api a lot, since it is meant to be used by a limited crowd. So I got rid of the parent_texts parameter, etc. 'commit' never used it, and neither does CHK repositories. So it doesn't really make sense to carry it around as legacy.
This mostly is around making 'commit' hold less data in memory. I think it is right around 3x the size of a large file (depending on text/binary, long/short lines, no eol, etc.) We are still too high for memory on GC repos, but that is the delta index code, and my other patch handles that.
I haven't really performance tested this by itself, because not computing the delta index makes such a large impact.
John A Meinel (jameinel) wrote : | # |
...
> I haven't really performance tested this by itself, because not computing the
> delta index makes such a large impact.
For committing a 90MB, short text line with no eol, this changes
1.9 469MB 5.7s
dev6 590MB 7.4s
to
1.9 361MB 5.0s
dev6 348MB 5.6s
with the delta-index patch also proposed, this goes further to:
dev6 216MB 4.0s
John A Meinel (jameinel) wrote : | # |
To give a slightly better summary, and include results for a 'wide' tree, more than just a single large file.
single 90MB file
old no-delta-index commit fulltext
1.9 496MB 5.7s 470MB 5.7s 361MB 5.0s
dev6 590MB 7.4s 457MB 5.8s 216MB 4.0s
initial commit of mysql (140 MB of texts)
1.9 44MB 12.3s 45MB 12.1s 45MB 11.5s
dev6 52MB 16.1s 48MB 15.6s 47MB 14.2s
Getting away from line based is improving speed by ~10% (15.6 => 14.2 = 1.4s),
this is on top of ~10% for not computing the delta index.
The larger the files the more impact, but it does, indeed, have an measurable
effect on initial commit times.
Andrew Bennetts (spiv) wrote : | # |
I have a couple of comments, but basically:
vote approve
John A Meinel wrote:
[...]
> + def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
> + """See VersionedFiles.
> + self._index.
> + self._check_
> + if text.__class__ is not str:
> + raise errors.
Well... 'text' might not be unicode either ;)
(I'm not really concerned, but ideally perhaps that exception would take the
param value or type as well as param name...)
[...]
> @@ -1597,7 +1615,7 @@
> if refs:
> for ref in refs:
> if ref:
> - raise KnitCorrupt(self,
> + raise errors.
Thanks for fixing this. pyflakes reports two other undefined names in this
file:
bzrlib/
bzrlib/
Now would be a good time to fix those too :)
> === modified file 'bzrlib/knit.py'
[...]
> + line_bytes = ''.join(lines)
> return self._add(key, lines, parents,
> - parent_texts, left_matching_
> + parent_texts, left_matching_
> + line_bytes=
You pass lines *and* line_bytes? That looks odd. I'm pretty sure it's
intentional, but it's worth pointing this out and explaining the rationale
in _add's docstring, I think.
> + def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
> + """See VersionedFiles.
> + self._index.
> + self._check_
> + if text.__class__ is not str:
> + raise errors.
> + if parents is None:
> + # The caller might pass None if there is no graph data, but kndx
> + # indexes can't directly store that, so we give them
> + # an empty tuple instead.
> + parents = ()
> + return self._add(key, None, parents,
> + None, None, nostore_sha, random_id,
> + line_bytes=text)
Hmm. It would be nice if this wasn't largely the same as _add_text in
groupcompress.py.
[...]
> - if lines:
> - if lines[-1][-1] != '\n':
> - # copy the contents of lines.
> + no_eol = False
> + # Note: line_bytes is not modified to add a newline, that is tracked
> + # via the no_eol flag. 'lines' *is* modified, because that is the
> + # general values needed by the Content code.
Gosh I hate our no_eol flag... :)
[...]
> @@ -1920,21 +1949,16 @@
> function spends less time resizing the final string.
> :return: (len, a StringIO instance with the raw data ready to read.)
> """
> - # Note: using a string copy here increases memory pressure with e.g.
> - # ISO's, but it is about 3 seconds faster on a 1.2Ghz intel machine
> - # when doing the ...
John A Meinel (jameinel) wrote : | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Bennetts wrote:
> Review: Approve
> I have a couple of comments, but basically:
>
> vote approve
>
> John A Meinel wrote:
> [...]
>> + def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
>> + """See VersionedFiles.
>> + self._index.
>> + self._check_
>> + if text.__class__ is not str:
>> + raise errors.
>
> Well... 'text' might not be unicode either ;)
>
> (I'm not really concerned, but ideally perhaps that exception would take the
> param value or type as well as param name...)
Sure. It was all that "_check_add" was doing, but _check_add does it by
iterating over the list of lines.
>
> [...]
>> @@ -1597,7 +1615,7 @@
>> if refs:
>> for ref in refs:
>> if ref:
>> - raise KnitCorrupt(self,
>> + raise errors.
>
> Thanks for fixing this. pyflakes reports two other undefined names in this
> file:
>
> bzrlib/
> bzrlib/
>
> Now would be a good time to fix those too :)
Done
Though of course it would be better to have tests that provoke the
behavior.... :)
>> === modified file 'bzrlib/knit.py'
> [...]
>> + line_bytes = ''.join(lines)
>> return self._add(key, lines, parents,
>> - parent_texts, left_matching_
>> + parent_texts, left_matching_
>> + line_bytes=
>
> You pass lines *and* line_bytes? That looks odd. I'm pretty sure it's
> intentional, but it's worth pointing this out and explaining the rationale
> in _add's docstring, I think.
Sure. It is because there are places that want lines (Content objects
require them, and our "diff" code requires them as well.) and the commit
code would rather have line_bytes (because writing a single string to
the gzip code is quite a bit faster than a bunch of short ones).
And this way, we have both types accessible when we need them, rather
than "re-casting" from whatever we have into the other type, and then
multiplying memory *again*.
I'll doc it better.
>
>> + def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
>> + """See VersionedFiles.
>> + self._index.
>> + self._check_
>> + if text.__class__ is not str:
>> + raise errors.
>> + if parents is None:
>> + # The caller might pass None if there is no graph data, but kndx
>> + # indexes can't directly store that, so we give them
>> + # an empty tuple instead.
>> + parents = ()
>> + return self._add(key, None, parents,
>> + None, None, nostore_sha, random_id,
>> + line_bytes=text)
>
> Hmm. It would be nice if th...
Andrew Bennetts (spiv) wrote : | # |
John A Meinel wrote:
> Andrew Bennetts wrote:
[...]
> > bzrlib/
> > bzrlib/
> >
> > Now would be a good time to fix those too :)
>
> Done
>
> Though of course it would be better to have tests that provoke the
> behavior.... :)
Be my guest ;)
I'm happy to take an obvious improvement though.
[...]
> > Hmm. It would be nice if this wasn't largely the same as _add_text in
> > groupcompress.py.
>
> All but the final "self._add" versus "record = FulltextContent
> self._insert...".
>
> It didn't seem like quite enough duplication to worry about. If you
> really prefer I can refactor.
It's up to you. I can see that it would be a little awkward to refactor
(it's not obvious to me that the common logic really belongs in the
VersionedFiles base class), and I agree about not “quite enough” to require
it.
So, “it would be nice”, but that's all :)
> >> === modified file 'bzrlib/
> > [...]
> >
> > The conditionals and repetition here are a bit ugly, but it'll do. Ideally
> > this would be done with the proper test parameterisation tools, though.
[...]
> So it is at least better, even if it isn't perfect.
>
> I suppose we could add a test permutation that tried to turn all calls
> to "add_lines" into "_add_text" and run against those. I don't feel it
> is quite worth that.
No, it probably isn't worth it. Thanks for taking a look.
-Andrew.
Preview Diff
1 | === modified file 'bzrlib/groupcompress.py' | |||
2 | --- bzrlib/groupcompress.py 2009-06-10 03:56:49 +0000 | |||
3 | +++ bzrlib/groupcompress.py 2009-06-16 02:36:16 +0000 | |||
4 | @@ -1008,6 +1008,24 @@ | |||
5 | 1008 | nostore_sha=nostore_sha))[0] | 1008 | nostore_sha=nostore_sha))[0] |
6 | 1009 | return sha1, length, None | 1009 | return sha1, length, None |
7 | 1010 | 1010 | ||
8 | 1011 | def _add_text(self, key, parents, text, nostore_sha=None, random_id=False): | ||
9 | 1012 | """See VersionedFiles.add_text().""" | ||
10 | 1013 | self._index._check_write_ok() | ||
11 | 1014 | self._check_add(key, None, random_id, check_content=False) | ||
12 | 1015 | if text.__class__ is not str: | ||
13 | 1016 | raise errors.BzrBadParameterUnicode("text") | ||
14 | 1017 | if parents is None: | ||
15 | 1018 | # The caller might pass None if there is no graph data, but kndx | ||
16 | 1019 | # indexes can't directly store that, so we give them | ||
17 | 1020 | # an empty tuple instead. | ||
18 | 1021 | parents = () | ||
19 | 1022 | # double handling for now. Make it work until then. | ||
20 | 1023 | length = len(text) | ||
21 | 1024 | record = FulltextContentFactory(key, parents, None, text) | ||
22 | 1025 | sha1 = list(self._insert_record_stream([record], random_id=random_id, | ||
23 | 1026 | nostore_sha=nostore_sha))[0] | ||
24 | 1027 | return sha1, length, None | ||
25 | 1028 | |||
26 | 1011 | def add_fallback_versioned_files(self, a_versioned_files): | 1029 | def add_fallback_versioned_files(self, a_versioned_files): |
27 | 1012 | """Add a source of texts for texts not present in this knit. | 1030 | """Add a source of texts for texts not present in this knit. |
28 | 1013 | 1031 | ||
29 | @@ -1613,7 +1631,7 @@ | |||
30 | 1613 | if refs: | 1631 | if refs: |
31 | 1614 | for ref in refs: | 1632 | for ref in refs: |
32 | 1615 | if ref: | 1633 | if ref: |
34 | 1616 | raise KnitCorrupt(self, | 1634 | raise errors.KnitCorrupt(self, |
35 | 1617 | "attempt to add node with parents " | 1635 | "attempt to add node with parents " |
36 | 1618 | "in parentless index.") | 1636 | "in parentless index.") |
37 | 1619 | refs = () | 1637 | refs = () |
38 | 1620 | 1638 | ||
39 | === modified file 'bzrlib/knit.py' | |||
40 | --- bzrlib/knit.py 2009-06-10 03:56:49 +0000 | |||
41 | +++ bzrlib/knit.py 2009-06-16 02:36:16 +0000 | |||
42 | @@ -909,18 +909,35 @@ | |||
43 | 909 | # indexes can't directly store that, so we give them | 909 | # indexes can't directly store that, so we give them |
44 | 910 | # an empty tuple instead. | 910 | # an empty tuple instead. |
45 | 911 | parents = () | 911 | parents = () |
46 | 912 | line_bytes = ''.join(lines) | ||
47 | 912 | return self._add(key, lines, parents, | 913 | return self._add(key, lines, parents, |
49 | 913 | parent_texts, left_matching_blocks, nostore_sha, random_id) | 914 | parent_texts, left_matching_blocks, nostore_sha, random_id, |
50 | 915 | line_bytes=line_bytes) | ||
51 | 916 | |||
52 | 917 | def _add_text(self, key, parents, text, nostore_sha=None, random_id=False): | ||
53 | 918 | """See VersionedFiles.add_text().""" | ||
54 | 919 | self._index._check_write_ok() | ||
55 | 920 | self._check_add(key, None, random_id, check_content=False) | ||
56 | 921 | if text.__class__ is not str: | ||
57 | 922 | raise errors.BzrBadParameterUnicode("text") | ||
58 | 923 | if parents is None: | ||
59 | 924 | # The caller might pass None if there is no graph data, but kndx | ||
60 | 925 | # indexes can't directly store that, so we give them | ||
61 | 926 | # an empty tuple instead. | ||
62 | 927 | parents = () | ||
63 | 928 | return self._add(key, None, parents, | ||
64 | 929 | None, None, nostore_sha, random_id, | ||
65 | 930 | line_bytes=text) | ||
66 | 914 | 931 | ||
67 | 915 | def _add(self, key, lines, parents, parent_texts, | 932 | def _add(self, key, lines, parents, parent_texts, |
69 | 916 | left_matching_blocks, nostore_sha, random_id): | 933 | left_matching_blocks, nostore_sha, random_id, |
70 | 934 | line_bytes): | ||
71 | 917 | """Add a set of lines on top of version specified by parents. | 935 | """Add a set of lines on top of version specified by parents. |
72 | 918 | 936 | ||
73 | 919 | Any versions not present will be converted into ghosts. | 937 | Any versions not present will be converted into ghosts. |
74 | 920 | """ | 938 | """ |
75 | 921 | # first thing, if the content is something we don't need to store, find | 939 | # first thing, if the content is something we don't need to store, find |
76 | 922 | # that out. | 940 | # that out. |
77 | 923 | line_bytes = ''.join(lines) | ||
78 | 924 | digest = sha_string(line_bytes) | 941 | digest = sha_string(line_bytes) |
79 | 925 | if nostore_sha == digest: | 942 | if nostore_sha == digest: |
80 | 926 | raise errors.ExistingContent | 943 | raise errors.ExistingContent |
81 | @@ -947,13 +964,22 @@ | |||
82 | 947 | 964 | ||
83 | 948 | text_length = len(line_bytes) | 965 | text_length = len(line_bytes) |
84 | 949 | options = [] | 966 | options = [] |
88 | 950 | if lines: | 967 | no_eol = False |
89 | 951 | if lines[-1][-1] != '\n': | 968 | # Note: line_bytes is not modified to add a newline, that is tracked |
90 | 952 | # copy the contents of lines. | 969 | # via the no_eol flag. 'lines' *is* modified, because that is the |
91 | 970 | # general values needed by the Content code. | ||
92 | 971 | if line_bytes and line_bytes[-1] != '\n': | ||
93 | 972 | options.append('no-eol') | ||
94 | 973 | no_eol = True | ||
95 | 974 | # Copy the existing list, or create a new one | ||
96 | 975 | if lines is None: | ||
97 | 976 | lines = osutils.split_lines(line_bytes) | ||
98 | 977 | else: | ||
99 | 953 | lines = lines[:] | 978 | lines = lines[:] |
103 | 954 | options.append('no-eol') | 979 | # Replace the last line with one that ends in a final newline |
104 | 955 | lines[-1] = lines[-1] + '\n' | 980 | lines[-1] = lines[-1] + '\n' |
105 | 956 | line_bytes += '\n' | 981 | if lines is None: |
106 | 982 | lines = osutils.split_lines(line_bytes) | ||
107 | 957 | 983 | ||
108 | 958 | for element in key[:-1]: | 984 | for element in key[:-1]: |
109 | 959 | if type(element) != str: | 985 | if type(element) != str: |
110 | @@ -965,7 +991,7 @@ | |||
111 | 965 | # Knit hunks are still last-element only | 991 | # Knit hunks are still last-element only |
112 | 966 | version_id = key[-1] | 992 | version_id = key[-1] |
113 | 967 | content = self._factory.make(lines, version_id) | 993 | content = self._factory.make(lines, version_id) |
115 | 968 | if 'no-eol' in options: | 994 | if no_eol: |
116 | 969 | # Hint to the content object that its text() call should strip the | 995 | # Hint to the content object that its text() call should strip the |
117 | 970 | # EOL. | 996 | # EOL. |
118 | 971 | content._should_strip_eol = True | 997 | content._should_strip_eol = True |
119 | @@ -986,8 +1012,11 @@ | |||
120 | 986 | if self._factory.__class__ is KnitPlainFactory: | 1012 | if self._factory.__class__ is KnitPlainFactory: |
121 | 987 | # Use the already joined bytes saving iteration time in | 1013 | # Use the already joined bytes saving iteration time in |
122 | 988 | # _record_to_data. | 1014 | # _record_to_data. |
123 | 1015 | dense_lines = [line_bytes] | ||
124 | 1016 | if no_eol: | ||
125 | 1017 | dense_lines.append('\n') | ||
126 | 989 | size, bytes = self._record_to_data(key, digest, | 1018 | size, bytes = self._record_to_data(key, digest, |
128 | 990 | lines, [line_bytes]) | 1019 | lines, dense_lines) |
129 | 991 | else: | 1020 | else: |
130 | 992 | # get mixed annotation + content and feed it into the | 1021 | # get mixed annotation + content and feed it into the |
131 | 993 | # serialiser. | 1022 | # serialiser. |
132 | @@ -1920,21 +1949,16 @@ | |||
133 | 1920 | function spends less time resizing the final string. | 1949 | function spends less time resizing the final string. |
134 | 1921 | :return: (len, a StringIO instance with the raw data ready to read.) | 1950 | :return: (len, a StringIO instance with the raw data ready to read.) |
135 | 1922 | """ | 1951 | """ |
148 | 1923 | # Note: using a string copy here increases memory pressure with e.g. | 1952 | chunks = ["version %s %d %s\n" % (key[-1], len(lines), digest)] |
149 | 1924 | # ISO's, but it is about 3 seconds faster on a 1.2Ghz intel machine | 1953 | chunks.extend(dense_lines or lines) |
150 | 1925 | # when doing the initial commit of a mozilla tree. RBC 20070921 | 1954 | chunks.append("end %s\n" % key[-1]) |
151 | 1926 | bytes = ''.join(chain( | 1955 | for chunk in chunks: |
152 | 1927 | ["version %s %d %s\n" % (key[-1], | 1956 | if type(chunk) != str: |
153 | 1928 | len(lines), | 1957 | raise AssertionError( |
154 | 1929 | digest)], | 1958 | 'data must be plain bytes was %s' % type(chunk)) |
143 | 1930 | dense_lines or lines, | ||
144 | 1931 | ["end %s\n" % key[-1]])) | ||
145 | 1932 | if type(bytes) != str: | ||
146 | 1933 | raise AssertionError( | ||
147 | 1934 | 'data must be plain bytes was %s' % type(bytes)) | ||
155 | 1935 | if lines and lines[-1][-1] != '\n': | 1959 | if lines and lines[-1][-1] != '\n': |
156 | 1936 | raise ValueError('corrupt lines value %r' % lines) | 1960 | raise ValueError('corrupt lines value %r' % lines) |
158 | 1937 | compressed_bytes = tuned_gzip.bytes_to_gzip(bytes) | 1961 | compressed_bytes = tuned_gzip.chunks_to_gzip(chunks) |
159 | 1938 | return len(compressed_bytes), compressed_bytes | 1962 | return len(compressed_bytes), compressed_bytes |
160 | 1939 | 1963 | ||
161 | 1940 | def _split_header(self, line): | 1964 | def _split_header(self, line): |
162 | 1941 | 1965 | ||
163 | === modified file 'bzrlib/repository.py' | |||
164 | --- bzrlib/repository.py 2009-06-12 01:11:00 +0000 | |||
165 | +++ bzrlib/repository.py 2009-06-16 02:36:16 +0000 | |||
166 | @@ -494,12 +494,12 @@ | |||
167 | 494 | ie.executable = content_summary[2] | 494 | ie.executable = content_summary[2] |
168 | 495 | file_obj, stat_value = tree.get_file_with_stat(ie.file_id, path) | 495 | file_obj, stat_value = tree.get_file_with_stat(ie.file_id, path) |
169 | 496 | try: | 496 | try: |
171 | 497 | lines = file_obj.readlines() | 497 | text = file_obj.read() |
172 | 498 | finally: | 498 | finally: |
173 | 499 | file_obj.close() | 499 | file_obj.close() |
174 | 500 | try: | 500 | try: |
175 | 501 | ie.text_sha1, ie.text_size = self._add_text_to_weave( | 501 | ie.text_sha1, ie.text_size = self._add_text_to_weave( |
177 | 502 | ie.file_id, lines, heads, nostore_sha) | 502 | ie.file_id, text, heads, nostore_sha) |
178 | 503 | # Let the caller know we generated a stat fingerprint. | 503 | # Let the caller know we generated a stat fingerprint. |
179 | 504 | fingerprint = (ie.text_sha1, stat_value) | 504 | fingerprint = (ie.text_sha1, stat_value) |
180 | 505 | except errors.ExistingContent: | 505 | except errors.ExistingContent: |
181 | @@ -517,8 +517,7 @@ | |||
182 | 517 | # carry over: | 517 | # carry over: |
183 | 518 | ie.revision = parent_entry.revision | 518 | ie.revision = parent_entry.revision |
184 | 519 | return self._get_delta(ie, basis_inv, path), False, None | 519 | return self._get_delta(ie, basis_inv, path), False, None |
187 | 520 | lines = [] | 520 | self._add_text_to_weave(ie.file_id, '', heads, None) |
186 | 521 | self._add_text_to_weave(ie.file_id, lines, heads, None) | ||
188 | 522 | elif kind == 'symlink': | 521 | elif kind == 'symlink': |
189 | 523 | current_link_target = content_summary[3] | 522 | current_link_target = content_summary[3] |
190 | 524 | if not store: | 523 | if not store: |
191 | @@ -532,8 +531,7 @@ | |||
192 | 532 | ie.symlink_target = parent_entry.symlink_target | 531 | ie.symlink_target = parent_entry.symlink_target |
193 | 533 | return self._get_delta(ie, basis_inv, path), False, None | 532 | return self._get_delta(ie, basis_inv, path), False, None |
194 | 534 | ie.symlink_target = current_link_target | 533 | ie.symlink_target = current_link_target |
197 | 535 | lines = [] | 534 | self._add_text_to_weave(ie.file_id, '', heads, None) |
196 | 536 | self._add_text_to_weave(ie.file_id, lines, heads, None) | ||
198 | 537 | elif kind == 'tree-reference': | 535 | elif kind == 'tree-reference': |
199 | 538 | if not store: | 536 | if not store: |
200 | 539 | if content_summary[3] != parent_entry.reference_revision: | 537 | if content_summary[3] != parent_entry.reference_revision: |
201 | @@ -544,8 +542,7 @@ | |||
202 | 544 | ie.revision = parent_entry.revision | 542 | ie.revision = parent_entry.revision |
203 | 545 | return self._get_delta(ie, basis_inv, path), False, None | 543 | return self._get_delta(ie, basis_inv, path), False, None |
204 | 546 | ie.reference_revision = content_summary[3] | 544 | ie.reference_revision = content_summary[3] |
207 | 547 | lines = [] | 545 | self._add_text_to_weave(ie.file_id, '', heads, None) |
206 | 548 | self._add_text_to_weave(ie.file_id, lines, heads, None) | ||
208 | 549 | else: | 546 | else: |
209 | 550 | raise NotImplementedError('unknown kind') | 547 | raise NotImplementedError('unknown kind') |
210 | 551 | ie.revision = self._new_revision_id | 548 | ie.revision = self._new_revision_id |
211 | @@ -745,7 +742,7 @@ | |||
212 | 745 | entry.executable = True | 742 | entry.executable = True |
213 | 746 | else: | 743 | else: |
214 | 747 | entry.executable = False | 744 | entry.executable = False |
216 | 748 | if (carry_over_possible and | 745 | if (carry_over_possible and |
217 | 749 | parent_entry.executable == entry.executable): | 746 | parent_entry.executable == entry.executable): |
218 | 750 | # Check the file length, content hash after reading | 747 | # Check the file length, content hash after reading |
219 | 751 | # the file. | 748 | # the file. |
220 | @@ -754,12 +751,12 @@ | |||
221 | 754 | nostore_sha = None | 751 | nostore_sha = None |
222 | 755 | file_obj, stat_value = tree.get_file_with_stat(file_id, change[1][1]) | 752 | file_obj, stat_value = tree.get_file_with_stat(file_id, change[1][1]) |
223 | 756 | try: | 753 | try: |
225 | 757 | lines = file_obj.readlines() | 754 | text = file_obj.read() |
226 | 758 | finally: | 755 | finally: |
227 | 759 | file_obj.close() | 756 | file_obj.close() |
228 | 760 | try: | 757 | try: |
229 | 761 | entry.text_sha1, entry.text_size = self._add_text_to_weave( | 758 | entry.text_sha1, entry.text_size = self._add_text_to_weave( |
231 | 762 | file_id, lines, heads, nostore_sha) | 759 | file_id, text, heads, nostore_sha) |
232 | 763 | yield file_id, change[1][1], (entry.text_sha1, stat_value) | 760 | yield file_id, change[1][1], (entry.text_sha1, stat_value) |
233 | 764 | except errors.ExistingContent: | 761 | except errors.ExistingContent: |
234 | 765 | # No content change against a carry_over parent | 762 | # No content change against a carry_over parent |
235 | @@ -774,7 +771,7 @@ | |||
236 | 774 | parent_entry.symlink_target == entry.symlink_target): | 771 | parent_entry.symlink_target == entry.symlink_target): |
237 | 775 | carried_over = True | 772 | carried_over = True |
238 | 776 | else: | 773 | else: |
240 | 777 | self._add_text_to_weave(change[0], [], heads, None) | 774 | self._add_text_to_weave(change[0], '', heads, None) |
241 | 778 | elif kind == 'directory': | 775 | elif kind == 'directory': |
242 | 779 | if carry_over_possible: | 776 | if carry_over_possible: |
243 | 780 | carried_over = True | 777 | carried_over = True |
244 | @@ -782,7 +779,7 @@ | |||
245 | 782 | # Nothing to set on the entry. | 779 | # Nothing to set on the entry. |
246 | 783 | # XXX: split into the Root and nonRoot versions. | 780 | # XXX: split into the Root and nonRoot versions. |
247 | 784 | if change[1][1] != '' or self.repository.supports_rich_root(): | 781 | if change[1][1] != '' or self.repository.supports_rich_root(): |
249 | 785 | self._add_text_to_weave(change[0], [], heads, None) | 782 | self._add_text_to_weave(change[0], '', heads, None) |
250 | 786 | elif kind == 'tree-reference': | 783 | elif kind == 'tree-reference': |
251 | 787 | if not self.repository._format.supports_tree_reference: | 784 | if not self.repository._format.supports_tree_reference: |
252 | 788 | # This isn't quite sane as an error, but we shouldn't | 785 | # This isn't quite sane as an error, but we shouldn't |
253 | @@ -797,7 +794,7 @@ | |||
254 | 797 | parent_entry.reference_revision == reference_revision): | 794 | parent_entry.reference_revision == reference_revision): |
255 | 798 | carried_over = True | 795 | carried_over = True |
256 | 799 | else: | 796 | else: |
258 | 800 | self._add_text_to_weave(change[0], [], heads, None) | 797 | self._add_text_to_weave(change[0], '', heads, None) |
259 | 801 | else: | 798 | else: |
260 | 802 | raise AssertionError('unknown kind %r' % kind) | 799 | raise AssertionError('unknown kind %r' % kind) |
261 | 803 | if not carried_over: | 800 | if not carried_over: |
262 | @@ -818,17 +815,11 @@ | |||
263 | 818 | self._require_root_change(tree) | 815 | self._require_root_change(tree) |
264 | 819 | self.basis_delta_revision = basis_revision_id | 816 | self.basis_delta_revision = basis_revision_id |
265 | 820 | 817 | ||
277 | 821 | def _add_text_to_weave(self, file_id, new_lines, parents, nostore_sha): | 818 | def _add_text_to_weave(self, file_id, new_text, parents, nostore_sha): |
278 | 822 | # Note: as we read the content directly from the tree, we know its not | 819 | parent_keys = tuple([(file_id, parent) for parent in parents]) |
279 | 823 | # been turned into unicode or badly split - but a broken tree | 820 | return self.repository.texts._add_text( |
280 | 824 | # implementation could give us bad output from readlines() so this is | 821 | (file_id, self._new_revision_id), parent_keys, new_text, |
281 | 825 | # not a guarantee of safety. What would be better is always checking | 822 | nostore_sha=nostore_sha, random_id=self.random_revid)[0:2] |
271 | 826 | # the content during test suite execution. RBC 20070912 | ||
272 | 827 | parent_keys = tuple((file_id, parent) for parent in parents) | ||
273 | 828 | return self.repository.texts.add_lines( | ||
274 | 829 | (file_id, self._new_revision_id), parent_keys, new_lines, | ||
275 | 830 | nostore_sha=nostore_sha, random_id=self.random_revid, | ||
276 | 831 | check_content=False)[0:2] | ||
282 | 832 | 823 | ||
283 | 833 | 824 | ||
284 | 834 | class RootCommitBuilder(CommitBuilder): | 825 | class RootCommitBuilder(CommitBuilder): |
285 | 835 | 826 | ||
286 | === modified file 'bzrlib/tests/test_tuned_gzip.py' | |||
287 | --- bzrlib/tests/test_tuned_gzip.py 2009-03-23 14:59:43 +0000 | |||
288 | +++ bzrlib/tests/test_tuned_gzip.py 2009-06-16 02:36:17 +0000 | |||
289 | @@ -85,3 +85,28 @@ | |||
290 | 85 | self.assertEqual('', stream.read()) | 85 | self.assertEqual('', stream.read()) |
291 | 86 | # and it should be new member time in the stream. | 86 | # and it should be new member time in the stream. |
292 | 87 | self.failUnless(myfile._new_member) | 87 | self.failUnless(myfile._new_member) |
293 | 88 | |||
294 | 89 | |||
295 | 90 | class TestToGzip(TestCase): | ||
296 | 91 | |||
297 | 92 | def assertToGzip(self, chunks): | ||
298 | 93 | bytes = ''.join(chunks) | ||
299 | 94 | gzfromchunks = tuned_gzip.chunks_to_gzip(chunks) | ||
300 | 95 | gzfrombytes = tuned_gzip.bytes_to_gzip(bytes) | ||
301 | 96 | self.assertEqual(gzfrombytes, gzfromchunks) | ||
302 | 97 | decoded = tuned_gzip.GzipFile(fileobj=StringIO(gzfromchunks)).read() | ||
303 | 98 | self.assertEqual(bytes, decoded) | ||
304 | 99 | |||
305 | 100 | def test_single_chunk(self): | ||
306 | 101 | self.assertToGzip(['a modest chunk\nwith some various\nbits\n']) | ||
307 | 102 | |||
308 | 103 | def test_simple_text(self): | ||
309 | 104 | self.assertToGzip(['some\n', 'strings\n', 'to\n', 'process\n']) | ||
310 | 105 | |||
311 | 106 | def test_large_chunks(self): | ||
312 | 107 | self.assertToGzip(['a large string\n'*1024]) | ||
313 | 108 | self.assertToGzip(['a large string\n']*1024) | ||
314 | 109 | |||
315 | 110 | def test_enormous_chunks(self): | ||
316 | 111 | self.assertToGzip(['a large string\n'*1024*256]) | ||
317 | 112 | self.assertToGzip(['a large string\n']*1024*256) | ||
318 | 88 | 113 | ||
319 | === modified file 'bzrlib/tests/test_versionedfile.py' | |||
320 | --- bzrlib/tests/test_versionedfile.py 2009-05-01 18:09:24 +0000 | |||
321 | +++ bzrlib/tests/test_versionedfile.py 2009-06-16 02:36:17 +0000 | |||
322 | @@ -1471,6 +1471,58 @@ | |||
323 | 1471 | self.addCleanup(lambda:self.cleanup(files)) | 1471 | self.addCleanup(lambda:self.cleanup(files)) |
324 | 1472 | return files | 1472 | return files |
325 | 1473 | 1473 | ||
326 | 1474 | def test_add_lines(self): | ||
327 | 1475 | f = self.get_versionedfiles() | ||
328 | 1476 | if self.key_length == 1: | ||
329 | 1477 | key0 = ('r0',) | ||
330 | 1478 | key1 = ('r1',) | ||
331 | 1479 | key2 = ('r2',) | ||
332 | 1480 | keyf = ('foo',) | ||
333 | 1481 | else: | ||
334 | 1482 | key0 = ('fid', 'r0') | ||
335 | 1483 | key1 = ('fid', 'r1') | ||
336 | 1484 | key2 = ('fid', 'r2') | ||
337 | 1485 | keyf = ('fid', 'foo') | ||
338 | 1486 | f.add_lines(key0, [], ['a\n', 'b\n']) | ||
339 | 1487 | if self.graph: | ||
340 | 1488 | f.add_lines(key1, [key0], ['b\n', 'c\n']) | ||
341 | 1489 | else: | ||
342 | 1490 | f.add_lines(key1, [], ['b\n', 'c\n']) | ||
343 | 1491 | keys = f.keys() | ||
344 | 1492 | self.assertTrue(key0 in keys) | ||
345 | 1493 | self.assertTrue(key1 in keys) | ||
346 | 1494 | records = [] | ||
347 | 1495 | for record in f.get_record_stream([key0, key1], 'unordered', True): | ||
348 | 1496 | records.append((record.key, record.get_bytes_as('fulltext'))) | ||
349 | 1497 | records.sort() | ||
350 | 1498 | self.assertEqual([(key0, 'a\nb\n'), (key1, 'b\nc\n')], records) | ||
351 | 1499 | |||
352 | 1500 | def test__add_text(self): | ||
353 | 1501 | f = self.get_versionedfiles() | ||
354 | 1502 | if self.key_length == 1: | ||
355 | 1503 | key0 = ('r0',) | ||
356 | 1504 | key1 = ('r1',) | ||
357 | 1505 | key2 = ('r2',) | ||
358 | 1506 | keyf = ('foo',) | ||
359 | 1507 | else: | ||
360 | 1508 | key0 = ('fid', 'r0') | ||
361 | 1509 | key1 = ('fid', 'r1') | ||
362 | 1510 | key2 = ('fid', 'r2') | ||
363 | 1511 | keyf = ('fid', 'foo') | ||
364 | 1512 | f._add_text(key0, [], 'a\nb\n') | ||
365 | 1513 | if self.graph: | ||
366 | 1514 | f._add_text(key1, [key0], 'b\nc\n') | ||
367 | 1515 | else: | ||
368 | 1516 | f._add_text(key1, [], 'b\nc\n') | ||
369 | 1517 | keys = f.keys() | ||
370 | 1518 | self.assertTrue(key0 in keys) | ||
371 | 1519 | self.assertTrue(key1 in keys) | ||
372 | 1520 | records = [] | ||
373 | 1521 | for record in f.get_record_stream([key0, key1], 'unordered', True): | ||
374 | 1522 | records.append((record.key, record.get_bytes_as('fulltext'))) | ||
375 | 1523 | records.sort() | ||
376 | 1524 | self.assertEqual([(key0, 'a\nb\n'), (key1, 'b\nc\n')], records) | ||
377 | 1525 | |||
378 | 1474 | def test_annotate(self): | 1526 | def test_annotate(self): |
379 | 1475 | files = self.get_versionedfiles() | 1527 | files = self.get_versionedfiles() |
380 | 1476 | self.get_diamond_files(files) | 1528 | self.get_diamond_files(files) |
381 | @@ -1520,7 +1572,7 @@ | |||
382 | 1520 | trailing_eol=trailing_eol, nograph=not self.graph, | 1572 | trailing_eol=trailing_eol, nograph=not self.graph, |
383 | 1521 | left_only=left_only, nokeys=nokeys) | 1573 | left_only=left_only, nokeys=nokeys) |
384 | 1522 | 1574 | ||
386 | 1523 | def test_add_lines_nostoresha(self): | 1575 | def _add_content_nostoresha(self, add_lines): |
387 | 1524 | """When nostore_sha is supplied using old content raises.""" | 1576 | """When nostore_sha is supplied using old content raises.""" |
388 | 1525 | vf = self.get_versionedfiles() | 1577 | vf = self.get_versionedfiles() |
389 | 1526 | empty_text = ('a', []) | 1578 | empty_text = ('a', []) |
390 | @@ -1528,7 +1580,12 @@ | |||
391 | 1528 | sample_text_no_nl = ('c', ["foo\n", "bar"]) | 1580 | sample_text_no_nl = ('c', ["foo\n", "bar"]) |
392 | 1529 | shas = [] | 1581 | shas = [] |
393 | 1530 | for version, lines in (empty_text, sample_text_nl, sample_text_no_nl): | 1582 | for version, lines in (empty_text, sample_text_nl, sample_text_no_nl): |
395 | 1531 | sha, _, _ = vf.add_lines(self.get_simple_key(version), [], lines) | 1583 | if add_lines: |
396 | 1584 | sha, _, _ = vf.add_lines(self.get_simple_key(version), [], | ||
397 | 1585 | lines) | ||
398 | 1586 | else: | ||
399 | 1587 | sha, _, _ = vf._add_text(self.get_simple_key(version), [], | ||
400 | 1588 | ''.join(lines)) | ||
401 | 1532 | shas.append(sha) | 1589 | shas.append(sha) |
402 | 1533 | # we now have a copy of all the lines in the vf. | 1590 | # we now have a copy of all the lines in the vf. |
403 | 1534 | for sha, (version, lines) in zip( | 1591 | for sha, (version, lines) in zip( |
404 | @@ -1537,10 +1594,19 @@ | |||
405 | 1537 | self.assertRaises(errors.ExistingContent, | 1594 | self.assertRaises(errors.ExistingContent, |
406 | 1538 | vf.add_lines, new_key, [], lines, | 1595 | vf.add_lines, new_key, [], lines, |
407 | 1539 | nostore_sha=sha) | 1596 | nostore_sha=sha) |
408 | 1597 | self.assertRaises(errors.ExistingContent, | ||
409 | 1598 | vf._add_text, new_key, [], ''.join(lines), | ||
410 | 1599 | nostore_sha=sha) | ||
411 | 1540 | # and no new version should have been added. | 1600 | # and no new version should have been added. |
412 | 1541 | record = vf.get_record_stream([new_key], 'unordered', True).next() | 1601 | record = vf.get_record_stream([new_key], 'unordered', True).next() |
413 | 1542 | self.assertEqual('absent', record.storage_kind) | 1602 | self.assertEqual('absent', record.storage_kind) |
414 | 1543 | 1603 | ||
415 | 1604 | def test_add_lines_nostoresha(self): | ||
416 | 1605 | self._add_content_nostoresha(add_lines=True) | ||
417 | 1606 | |||
418 | 1607 | def test__add_text_nostoresha(self): | ||
419 | 1608 | self._add_content_nostoresha(add_lines=False) | ||
420 | 1609 | |||
421 | 1544 | def test_add_lines_return(self): | 1610 | def test_add_lines_return(self): |
422 | 1545 | files = self.get_versionedfiles() | 1611 | files = self.get_versionedfiles() |
423 | 1546 | # save code by using the stock data insertion helper. | 1612 | # save code by using the stock data insertion helper. |
424 | 1547 | 1613 | ||
425 | === modified file 'bzrlib/tuned_gzip.py' | |||
426 | --- bzrlib/tuned_gzip.py 2009-03-23 14:59:43 +0000 | |||
427 | +++ bzrlib/tuned_gzip.py 2009-06-16 02:36:16 +0000 | |||
428 | @@ -52,6 +52,18 @@ | |||
429 | 52 | width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL, | 52 | width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL, |
430 | 53 | crc32=zlib.crc32): | 53 | crc32=zlib.crc32): |
431 | 54 | """Create a gzip file containing bytes and return its content.""" | 54 | """Create a gzip file containing bytes and return its content.""" |
432 | 55 | return chunks_to_gzip([bytes]) | ||
433 | 56 | |||
434 | 57 | |||
435 | 58 | def chunks_to_gzip(chunks, factory=zlib.compressobj, | ||
436 | 59 | level=zlib.Z_DEFAULT_COMPRESSION, method=zlib.DEFLATED, | ||
437 | 60 | width=-zlib.MAX_WBITS, mem=zlib.DEF_MEM_LEVEL, | ||
438 | 61 | crc32=zlib.crc32): | ||
439 | 62 | """Create a gzip file containing chunks and return its content. | ||
440 | 63 | |||
441 | 64 | :param chunks: An iterable of strings. Each string can have arbitrary | ||
442 | 65 | layout. | ||
443 | 66 | """ | ||
444 | 55 | result = [ | 67 | result = [ |
445 | 56 | '\037\213' # self.fileobj.write('\037\213') # magic header | 68 | '\037\213' # self.fileobj.write('\037\213') # magic header |
446 | 57 | '\010' # self.fileobj.write('\010') # compression method | 69 | '\010' # self.fileobj.write('\010') # compression method |
447 | @@ -69,11 +81,17 @@ | |||
448 | 69 | # using a compressobj avoids a small header and trailer that the compress() | 81 | # using a compressobj avoids a small header and trailer that the compress() |
449 | 70 | # utility function adds. | 82 | # utility function adds. |
450 | 71 | compress = factory(level, method, width, mem, 0) | 83 | compress = factory(level, method, width, mem, 0) |
452 | 72 | result.append(compress.compress(bytes)) | 84 | crc = 0 |
453 | 85 | total_len = 0 | ||
454 | 86 | for chunk in chunks: | ||
455 | 87 | crc = crc32(chunk, crc) | ||
456 | 88 | total_len += len(chunk) | ||
457 | 89 | zbytes = compress.compress(chunk) | ||
458 | 90 | if zbytes: | ||
459 | 91 | result.append(zbytes) | ||
460 | 73 | result.append(compress.flush()) | 92 | result.append(compress.flush()) |
461 | 74 | result.append(struct.pack("<L", LOWU32(crc32(bytes)))) | ||
462 | 75 | # size may exceed 2GB, or even 4GB | 93 | # size may exceed 2GB, or even 4GB |
464 | 76 | result.append(struct.pack("<L", LOWU32(len(bytes)))) | 94 | result.append(struct.pack("<LL", LOWU32(crc), LOWU32(total_len))) |
465 | 77 | return ''.join(result) | 95 | return ''.join(result) |
466 | 78 | 96 | ||
467 | 79 | 97 | ||
468 | 80 | 98 | ||
469 | === modified file 'bzrlib/versionedfile.py' | |||
470 | --- bzrlib/versionedfile.py 2009-06-10 03:56:49 +0000 | |||
471 | +++ bzrlib/versionedfile.py 2009-06-16 02:36:17 +0000 | |||
472 | @@ -829,6 +829,36 @@ | |||
473 | 829 | """ | 829 | """ |
474 | 830 | raise NotImplementedError(self.add_lines) | 830 | raise NotImplementedError(self.add_lines) |
475 | 831 | 831 | ||
476 | 832 | def _add_text(self, key, parents, text, nostore_sha=None, random_id=False): | ||
477 | 833 | """Add a text to the store. | ||
478 | 834 | |||
479 | 835 | This is a private function for use by CommitBuilder. | ||
480 | 836 | |||
481 | 837 | :param key: The key tuple of the text to add. If the last element is | ||
482 | 838 | None, a CHK string will be generated during the addition. | ||
483 | 839 | :param parents: The parents key tuples of the text to add. | ||
484 | 840 | :param text: A string containing the text to be committed. | ||
485 | 841 | :param nostore_sha: Raise ExistingContent and do not add the lines to | ||
486 | 842 | the versioned file if the digest of the lines matches this. | ||
487 | 843 | :param random_id: If True a random id has been selected rather than | ||
488 | 844 | an id determined by some deterministic process such as a converter | ||
489 | 845 | from a foreign VCS. When True the backend may choose not to check | ||
490 | 846 | for uniqueness of the resulting key within the versioned file, so | ||
491 | 847 | this should only be done when the result is expected to be unique | ||
492 | 848 | anyway. | ||
493 | 849 | :param check_content: If True, the lines supplied are verified to be | ||
494 | 850 | bytestrings that are correctly formed lines. | ||
495 | 851 | :return: The text sha1, the number of bytes in the text, and an opaque | ||
496 | 852 | representation of the inserted version which can be provided | ||
497 | 853 | back to future _add_text calls in the parent_texts dictionary. | ||
498 | 854 | """ | ||
499 | 855 | # The default implementation just thunks over to .add_lines(), | ||
500 | 856 | # inefficient, but it works. | ||
501 | 857 | return self.add_lines(key, parents, osutils.split_lines(text), | ||
502 | 858 | nostore_sha=nostore_sha, | ||
503 | 859 | random_id=random_id, | ||
504 | 860 | check_content=True) | ||
505 | 861 | |||
506 | 832 | def add_mpdiffs(self, records): | 862 | def add_mpdiffs(self, records): |
507 | 833 | """Add mpdiffs to this VersionedFile. | 863 | """Add mpdiffs to this VersionedFile. |
508 | 834 | 864 |
This branch adds a new api, VersionedFiles. add_text( ). If people really want, I could change it to VF.add_chunks(), but add_text() fits what I needed, and was expedient.
The main effect is to change 'bzr commit' to use file.read() rather than file.readlines(), and then to pass that on to VF.add_text() rather than VF.add_lines().
It also spends a little bit of time to remove some of the bits that were causing us to copy the memory structures a lot. It doesn't completely remove the need for a list of lines during Knit.commit() but it *does* remove the need during --dev6.commit.
To test this, I created a 90MB file, which consists of mostly 20 byte strings with no final newline. I then did:
rm -rf .bzr; bzr init --format=X; bzr add; time bzr commit -Dmemory -m "bigfile"
For --pack-0.92:
pre 469,748 kB, 5.554s
post 360,836 kB, 4.789s
For --development6- rich-root:
pre 589,732 kB, 7.785s
post 348,796 kB, 5.803s
So it is both faster and smaller. Though I still need to explore why --dev6 isn't more memory friendly. It seems to be because of the DeltaIndex structures as part of Groupcompress blocks. It might be worthwhile to optimize for not creating those on the first insert into a new group. (for the 90MB file, it seems to allocate 97MB for the 'loose' index, and then packs that into a 134MB index that has empty slots.)