Bazaar

Merge lp:~spiv/bzr/inventory-delta into lp:~bzr/bzr/trunk-old

inventory-delta
Merge into trunk-old

Proposed by Andrew Bennetts on 2009-08-05

Status:

Merged

Merged at revision:

not available

Proposed branch:

lp:~spiv/bzr/inventory-delta

Merge into:

lp:~bzr/bzr/trunk-old

Diff against target:

2845 lines

To merge this branch:

bzr merge lp:~spiv/bzr/inventory-delta

Related bugs:

Bug #385826: push and pull with different formats extremely slow on network	Critical	Fix Released
Bug #410917: inter-format push to remote 2a branch does packing on local end	Medium	Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel		2009-08-05	Needs Fixing on 2009-08-06
Review via email: mp+9676@code.launchpad.net

This proposal supersedes a proposal from 2009-07-22.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-07-16: Posted in a previous version of this proposal

This is a pretty big patch. It does lots of things:

* adds new insert_stream and get_stream verbs
* adds de/serialization of inventory-delta records on the network
* fixes rich-root generation in StreamSource
* adds a bunch of new scenarios to per_interrepository tests
* fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
the new pack is already optimal).
* improves the inventory_delta module a little
* various miscellaneous fixes and new tests that are hopefully self-evident
* and, most controversially, removes InterDifferingSerializer.

From John's mail a while back there were a bunch of issues with removing IDS. I
think the outstanding ones are:

> 1) Incremental updates. IDS converts batches of 100 revs at a time,
> which also triggers autopacks at 1k revs. Streaming fetch is currently
> an all-or-nothing, which isn't appropriate (IMO) for conversions.
> Consider that conversion can take *days*, it is important to have
> something that can be stopped and resumed.
>
> 2) Also, auto-packing as you go avoids the case you ran into, where bzr
> bloats to 2.4GB before packing back to 25MB. We know the new format is
> even more sensitive to packing efficiency. Not to mention that a single
> big-stream generates a single large pack, it isn't directly obvious that
> we are being so inefficient.

i.e. performance concerns.

The streaming code is pretty similar in how it does the conversion now to the
way IDS did it, but probably still different enough that we will want to measure
the impact of this. I'm definitely concerned about case 2, the lack of packing
as you go, although perhaps the degree of bloat is reduced by using
semantic inventory-delta records?

The reason why I eventually deleted IDS was that it was just too burdensome to
keep two code paths alive, thoroughly tested, and correct. For instance, if we
simply reinstated IDS for local-only fetches then most of the test suite,
including the relevant interrepo tests, will only exercise IDS. Also, IDS
turned out to have a bug when used on a stacked repository that the extending
test suite in this branch revealed (I've forgotten the details, but can dig them
up if you like). It didn't seem worth the hassle of fixing IDS when I already
had a working implementation.

I'm certainly open to reinstating IDS if it's the most expedient way to have
reasonable local performance for upgrades, but I thought I'd try to be bold and
see if we could just live without the extra complexity. Maybe we can improve
performance of streaming rather than resurrect IDS?

-Andrew.

This is a pretty big patch.  It does lots of things:

* adds new insert_stream and get_stream verbs
 * adds de/serialization of inventory-delta records on the network
 * fixes rich-root generation in StreamSource
 * adds a bunch of new scenarios to per_interrepository tests
 * fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
   the new pack is already optimal).
 * improves the inventory_delta module a little
 * various miscellaneous fixes and new tests that are hopefully self-evident
 * and, most controversially, removes InterDifferingSerializer.

From John's mail a while back there were a bunch of issues with removing IDS.  I
think the outstanding ones are:

> 1) Incremental updates. IDS converts batches of 100 revs at a time,
> which also triggers autopacks at 1k revs. Streaming fetch is currently
> an all-or-nothing, which isn't appropriate (IMO) for conversions.
> Consider that conversion can take *days*, it is important to have
> something that can be stopped and resumed.
> 
> 2) Also, auto-packing as you go avoids the case you ran into, where bzr
> bloats to 2.4GB before packing back to 25MB. We know the new format is
> even more sensitive to packing efficiency. Not to mention that a single
> big-stream generates a single large pack, it isn't directly obvious that
> we are being so inefficient.

i.e. performance concerns.

The streaming code is pretty similar in how it does the conversion now to the
way IDS did it, but probably still different enough that we will want to measure
the impact of this.  I'm definitely concerned about case 2, the lack of packing
as you go, although perhaps the degree of bloat is reduced by using
semantic inventory-delta records?

The reason why I eventually deleted IDS was that it was just too burdensome to
keep two code paths alive, thoroughly tested, and correct.  For instance, if we
simply reinstated IDS for local-only fetches then most of the test suite,
including the relevant interrepo tests, will only exercise IDS.  Also, IDS
turned out to have a bug when used on a stacked repository that the extending
test suite in this branch revealed (I've forgotten the details, but can dig them
up if you like).  It didn't seem worth the hassle of fixing IDS when I already
had a working implementation.

I'm certainly open to reinstating IDS if it's the most expedient way to have
reasonable local performance for upgrades, but I thought I'd try to be bold and
see if we could just live without the extra complexity.  Maybe we can improve
performance of streaming rather than resurrect IDS?

-Andrew.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-16: Posted in a previous version of this proposal

Download full text (4.5 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/inventory-delta into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This is a pretty big patch. It does lots of things:
>
> * adds new insert_stream and get_stream verbs
> * adds de/serialization of inventory-delta records on the network
> * fixes rich-root generation in StreamSource
> * adds a bunch of new scenarios to per_interrepository tests
> * fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
> the new pack is already optimal).
> * improves the inventory_delta module a little
> * various miscellaneous fixes and new tests that are hopefully self-evident
> * and, most controversially, removes InterDifferingSerializer.
>
>>From John's mail a while back there were a bunch of issues with removing IDS. I
> think the outstanding ones are:
>
>> 1) Incremental updates. IDS converts batches of 100 revs at a time,
>> which also triggers autopacks at 1k revs. Streaming fetch is currently
>> an all-or-nothing, which isn't appropriate (IMO) for conversions.
>> Consider that conversion can take *days*, it is important to have
>> something that can be stopped and resumed.

It also picks out the 'optimal' deltas by computing many different ones
and finding whichever one was the 'smallest'. For local conversions, the
time to compute 2-3 deltas was much smaller than to apply an inefficient
delta.

>>
>> 2) Also, auto-packing as you go avoids the case you ran into, where bzr
>> bloats to 2.4GB before packing back to 25MB. We know the new format is
>> even more sensitive to packing efficiency. Not to mention that a single
>> big-stream generates a single large pack, it isn't directly obvious that
>> we are being so inefficient.
>
> i.e. performance concerns.
>

Generally, yes.

There is also:

3) Being able to resume because you snapshotted periodically as you
went. This seems even more important for a network transfer.

> The streaming code is pretty similar in how it does the conversion now to the
> way IDS did it, but probably still different enough that we will want to measure
> the impact of this. I'm definitely concerned about case 2, the lack of packing
> as you go, although perhaps the degree of bloat is reduced by using
> semantic inventory-delta records?
>

I don't think bzr bloating from 100MB => 2.4GB (and then back down to
25MB post pack) was because of inventory records. However, if it was
purely because of a bad streaming order, we could probably fix that by
changing how we stream texts.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/inventory-delta into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> This is a pretty big patch.  It does lots of things:
> 
>  * adds new insert_stream and get_stream verbs
>  * adds de/serialization of inventory-delta records on the network
>  * fixes rich-root generation in StreamSource
>  * adds a bunch of new scenarios to per_interrepository tests
>  * fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
>    the new pack is already optimal).
>  * improves the inventory_delta module a little
>  * various miscellaneous fixes and new tests that are hopefully self-evident
>  * and, most controversially, removes InterDifferingSerializer.
> 
>>From John's mail a while back there were a bunch of issues with removing IDS.  I
> think the outstanding ones are:
> 
>> 1) Incremental updates. IDS converts batches of 100 revs at a time,
>> which also triggers autopacks at 1k revs. Streaming fetch is currently
>> an all-or-nothing, which isn't appropriate (IMO) for conversions.
>> Consider that conversion can take *days*, it is important to have
>> something that can be stopped and resumed.

Generally, yes.

There is also:

3) Being able to resume because you snapshotted periodically as you
went. This seems even more important for a network transfer.

> The streaming code is pretty similar in how it does the conversion now to the
> way IDS did it, but probably still different enough that we will want to measure
> the impact of this.  I'm definitely concerned about case 2, the lack of packing
> as you go, although perhaps the degree of bloat is reduced by using
> semantic inventory-delta records?
>

> The reason why I eventually deleted IDS was that it was just too burdensome to
> keep two code paths alive, thoroughly tested, and correct.  For instance, if we
> simply reinstated IDS for local-only fetches then most of the test suite,
> including the relevant interrepo tests, will only exercise IDS.  Also, IDS
> turned out to have a bug when used on a stacked repository that the extending
> test suite in this branch revealed (I've forgotten the details, but can dig them
> up if you like).  It didn't seem worth the hassle of fixing IDS when I already
> had a working implementation.
> 
> I'm certainly open to reinstating IDS if it's the most expedient way to have
> reasonable local performance for upgrades, but I thought I'd try to be bold and
> see if we could just live without the extra complexity.  Maybe we can improve
> performance of streaming rather than resurrect IDS?
> 
> -Andrew.

I'm certainly open to the suggestion of getting rid of IDS. I don't like
having multiple code paths. It just happens that there are *big* wins
and it is often easier to write optimized code in a different framework.

(3) is an issue I'd like to see addressed, but which Robert seems
particularly unhappy having us try to do. (See other bug comments, etc
about how other systems don't do it and he feels it isn't worth doing.)

It was fairly straightforward to do with IDS, the argument I think from
Robert is that the client would need to be computing whether it has a
'complete' set and thus can commit the current write group. (the
*source* knows these sort of things, and can just say "and now you have
it", but the client has to re-do all that work to figure it out from a
stream.)

I'll send a follow up email later to do more of a review of the code
changes.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfT4UACgkQJdeBCYSNAAM0kQCfTO0rlN9Zl1LDues6IpHdp7ju
FHEAoKgP+f81hxKB6o3iyVt5mYPoFyUD
=QSKC
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-16: Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
>
> There is also:
>
> 3) Being able to resume because you snapshotted periodically as you
> went. This seems even more important for a network transfer.

and

4) Progress indication

This is really quite useful for a process that can take *days* to
complete. The Stream code is often quite nice, but the fact that it
gives you 2 states:
'getting stream'
'inserting stream'

and nothing more than that is pretty crummy.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfUK8ACgkQJdeBCYSNAAMa+wCgybpPdd4Yie/Craew/zxX9eF7
cWMAoNcxPftDDdLssboDW7rezk4d2L2d
=WA26
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-16: Posted in a previous version of this proposal

Download full text (3.2 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So for starters, let me mention what I found wrt performance:

time bzr.dev branch mysql-1k myqsl-2a/1k
real 3m18.490s

time bzr.dev+xml8 branch mysql-1k myqsl-2a/1k
real 2m29.953s

+xml8 is just this patch:
=== modified file 'bzrlib/xml8.py'
- --- bzrlib/xml8.py 2009-07-07 04:32:13 +0000
+++ bzrlib/xml8.py 2009-07-16 16:14:38 +0000
@@ -433,9 +433,9 @@
                 pass
             else:
                 # Only copying directory entries drops us 2.85s => 2.35s
- - # if cached_ie.kind == 'directory':
- - # return cached_ie.copy()
- - # return cached_ie
+ if cached_ie.kind == 'directory':
+ return cached_ie.copy()
+ return cached_ie
                 return cached_ie.copy()

kind = elt.tag

It has 2 basic effects:

1) Avoid copying all inventory entries all the time (so reduce the time
spent in InventoryEntry.copy())

2) By re-using exact objects "_make_delta" can do "x is y" comparisons,
rather than having to do:
x.attribute1 == y.attribute1
and x.attribute2 == y.attribute2
etc.

As you can see it is a big win for this test case (about 4:3 or 33% faster)

So what about Andrew's work:

time bzr.inv.delta branch mysql-1k myqsl-2a/1k
real 10m14.267s

time bzr.inv.delta+xml8 branch mysql-1k myqsl-2a/1k
real 9m49.372s

It also was stuck at:
[##################- ] Fetching revisions:Inserting stream:Walking
content 912/1043

For most of that time, making it really look like it was stalled.

Anyway, this isn't something where it is, say, 10% slower which is
acceptable because we get rid of some extra code paths. This ends up
being 3-4x slower and no longer giving any progress information.

If that scales to launchpad sized projects, you are talking 4-days
becoming 16-days (aka > 2 weeks).

So honestly, I don't think we can land this as is. I won't stick on the
performance side if people feel it is acceptable. But I did spend a lot
of time optimizing IDS that clearly hasn't been done with StreamSource.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfWm0ACgkQJdeBCYSNAAM8mgCgru3K3SpP8BcMZdLJLH...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/inventory-delta into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> This is a pretty big patch.  It does lots of things:
> 
>  * adds new insert_stream and get_stream verbs
>  * adds de/serialization of inventory-delta records on the network
>  * fixes rich-root generation in StreamSource
>  * adds a bunch of new scenarios to per_interrepository tests
>  * fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
>    the new pack is already optimal).
>  * improves the inventory_delta module a little
>  * various miscellaneous fixes and new tests that are hopefully self-evident
>  * and, most controversially, removes InterDifferingSerializer.
> 
>>From John's mail a while back there were a bunch of issues with removing IDS.  I
> think the outstanding ones are:

So for starters, let me mention what I found wrt performance:

time bzr.dev branch mysql-1k myqsl-2a/1k
  real    3m18.490s

time bzr.dev+xml8 branch mysql-1k myqsl-2a/1k
  real    2m29.953s

+xml8 is just this patch:
=== modified file 'bzrlib/xml8.py'
- --- bzrlib/xml8.py      2009-07-07 04:32:13 +0000
+++ bzrlib/xml8.py      2009-07-16 16:14:38 +0000
@@ -433,9 +433,9 @@
                 pass
             else:
                 # Only copying directory entries drops us 2.85s => 2.35s
- -                # if cached_ie.kind == 'directory':
- -                #     return cached_ie.copy()
- -                # return cached_ie
+                if cached_ie.kind == 'directory':
+                    return cached_ie.copy()
+                return cached_ie
                 return cached_ie.copy()

kind = elt.tag

It has 2 basic effects:

1) Avoid copying all inventory entries all the time (so reduce the time
spent in InventoryEntry.copy())

2) By re-using exact objects "_make_delta" can do "x is y" comparisons,
rather than having to do:
  x.attribute1 == y.attribute1
  and x.attribute2 == y.attribute2
etc.

As you can see it is a big win for this test case (about 4:3 or 33% faster)

So what about Andrew's work:

time bzr.inv.delta branch mysql-1k myqsl-2a/1k
  real    10m14.267s

time bzr.inv.delta+xml8 branch mysql-1k myqsl-2a/1k
  real    9m49.372s

It also was stuck at:
[##################- ] Fetching revisions:Inserting stream:Walking
content 912/1043

For most of that time, making it really look like it was stalled.

If that scales to launchpad sized projects, you are talking 4-days
becoming 16-days (aka > 2 weeks).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfWm0ACgkQJdeBCYSNAAM8mgCgru3K3SpP8BcMZdLJLHkqSSTQ
TlAAoLVd7ndCbPeHNl3Kxbu0clMKLnR7
=BB8Q
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-16: Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
> Andrew Bennetts wrote:

...

> So for starters, let me mention what I found wrt performance:
>
> time bzr.dev branch mysql-1k myqsl-2a/1k
> real 3m18.490s
>
> time bzr.dev+xml8 branch mysql-1k myqsl-2a/1k
> real 2m29.953s

...

> time bzr.inv.delta branch mysql-1k myqsl-2a/1k
> real 10m14.267s
>
> time bzr.inv.delta+xml8 branch mysql-1k myqsl-2a/1k
> real 9m49.372s

Also, for real-world space issues:
$ du -ksh mysql-2a*/.bzr/repository/obsolete*
1.9M mysql-2a-bzr.dev/.bzr/repository/obsolete_packs
467M mysql-2a-inv-delta/.bzr/repository/obsolete_packs

The peak size (watch du -ksh mysql-2a-bzr.dev) during conversion using
IDS was 49MB.

$ du -ksh mysql-2a*/.bzr/repository/packs*
11M mysql-2a-bzr.dev/.bzr/repository/packs
9.1M mysql-2a-inv-delta/.bzr/repository/packs

So the new code wins slightly in the final size on disk, because it
packed at the end, rather than at 1k revs (and then there were another
40+ revs inserted.)

However, it bloated from 15MB => 467MB while it was doing the transfer
before the final size. Versus a peak of 50MB (almost 10x larger).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfdCkACgkQJdeBCYSNAANABACgl4l4L1AjaiXRJgrn5iwLrVe1
tVEAnRRJ0QbWzd8lXFXQXhWdhvqFjnw8
=pXZe
-----END PGP SIGNATURE-----

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-07-17: Posted in a previous version of this proposal

John A Meinel wrote:
[...]
> It also picks out the 'optimal' deltas by computing many different ones
> and finding whichever one was the 'smallest'. For local conversions, the
> time to compute 2-3 deltas was much smaller than to apply an inefficient
> delta.

FWIW, the streaming code also does this. My guess (not yet measured) is that
sending less bytes over the network is also a win, especially when one parent
might be a one-liner and the other might be large merge from trunk.

[...]
> There is also:
>
> 3) Being able to resume because you snapshotted periodically as you
> went. This seems even more important for a network transfer.

Yes, although we already don't have this for the network. It would be great to
have...

[...]
> I'm certainly open to the suggestion of getting rid of IDS. I don't like
> having multiple code paths. It just happens that there are *big* wins
> and it is often easier to write optimized code in a different framework.

Sure. Like I said for me it was just getting to be a large hassle to maintain
both paths in my branch, even though they were increasingly sharing a lot of
code for e.g. rich root generation before I deleted IDS.

I'd like to try see if we can cheaply fix the performance issues you report in
other mails without needing IDS. If we do need IDS for a while longer then
fine, although I think we'll want to restrict it to local source, local target,
non-stacked cases only.

Thanks for the measurements and quick feedback.

-Andrew.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-07-17: Posted in a previous version of this proposal

On Thu, 2009-07-16 at 16:06 +0000, John A Meinel wrote:
>
> (3) is an issue I'd like to see addressed, but which Robert seems
> particularly unhappy having us try to do. (See other bug comments, etc
> about how other systems don't do it and he feels it isn't worth
> doing.)

I'd like to be clear about this. I'd be ecstatic *if* we can do it well
and robustly. However I don't think it is *at all* easy to that. If I'm
wrong - great.

I'm fine with keeping IDS for local fetches. But when networking is
involved IDS is massively slower than the streaming codepath.

> It was fairly straightforward to do with IDS, the argument I think
> from
> Robert is that the client would need to be computing whether it has a
> 'complete' set and thus can commit the current write group. (the
> *source* knows these sort of things, and can just say "and now you
> have
> it", but the client has to re-do all that work to figure it out from a
> stream.)

I think that aspect is simple - we have a stream subtype that says
'checkpoint'. Its the requirement to do all that work that is, I think
problematic - and thats *without* considering stacking, which makes it
hugely harder.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2009-07-17: Posted in a previous version of this proposal

On Thu, 2009-07-16 at 16:12 +0000, John A Meinel wrote:
>
>
> 4) Progress indication
>
> This is really quite useful for a process that can take *days* to
> complete. The Stream code is often quite nice, but the fact that it
> gives you 2 states:
> 'getting stream'
> 'inserting stream'
>
> and nothing more than that is pretty crummy.

That is a separate bug however, and one that affects normal fetches too.
So I don't think tying it to the IDS discussion is necessary or
particularly helpful.

-Rob

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-17: Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2009-07-16 at 16:12 +0000, John A Meinel wrote:
>>
>> 4) Progress indication
>>
>> This is really quite useful for a process that can take *days* to
>> complete. The Stream code is often quite nice, but the fact that it
>> gives you 2 states:
>> 'getting stream'
>> 'inserting stream'
>>
>> and nothing more than that is pretty crummy.
>
> That is a separate bug however, and one that affects normal fetches too.
> So I don't think tying it to the IDS discussion is necessary or
> particularly helpful.
>
> -Rob
>

It is explicitly relevant that doing "bzr upgrade --2a" which will take
longer-than-normal would now not even show a progress bar.

For local fetches, you don't even get the "transport activity"
indicator, so it *really* looks hung. It doesn't even write things into
.bzr.log so that you know it is doing anything other than spinning in a
while True loop. I guess you can tell because your disk consumption is
going way up...

I don't honestly know the performance difference for streaming a lot of
content over the network. Given a 4x performance slowdown, for large
fetches IDS could still be faster. I certainly agree that IDS is
probably significantly more inefficient when doing something like "give
me the last 2 revs".

It honestly wasn't something I was optimizing for (cross format
fetching). I *was* trying to make 'bzr upgrade' be measured in hours
rather than days/weeks/etc.

Also, given that you have to upgrade all of your stacked locations at
the same time, and --2a is a trap door, aren't 95% of upgrades going to
be all at once anyway?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpf46YACgkQJdeBCYSNAANwAwCfYQj7gws3O4KDPxqrcMLu4nfB
554AoIyuns4b5Fsa3wf4uFhf4Uex00oQ
=qjX9
-----END PGP SIGNATURE-----

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-07-22: Posted in a previous version of this proposal

This is the same patch updated for bzr.dev, and with InterDifferingSerializer restored for local fetches only. For non-local fetches I'm pretty sure the streaming code path is massively faster based on the timings I've done (even via TCP HPSS to localhost).

I'm sure we do want to get rid of IDS eventually (and fix the shortcomings in streaming that John has pointed out) but doing that shouldn't block the rest of this work, even if it is a small maintenance headache.

[FWIW, if I don't restrict IDS to local-only branches I get test failures like:

Traceback (most recent call last):
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/tests/per_interrepository/test_fetch.py", line 137, in test_fetch_parent_inventories_at_stacking_boundary_smart_old
    self.test_fetch_parent_inventories_at_stacking_boundary()
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/tests/per_interrepository/test_fetch.py", line 181, in test_fetch_parent_inventories_at_stacking_boundary
    self.assertCanStreamRevision(unstacked_repo, 'merge')
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/tests/per_interrepository/test_fetch.py", line 187, in assertCanStreamRevision
    for substream_kind, substream in source.get_stream(search):
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/remote.py", line 1895, in missing_parents_chain
    for kind, stream in self._get_stream(sources[0], search):
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/smart/repository.py", line 537, in record_stream
    for bytes in byte_stream:
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/smart/message.py", line 338, in read_streamed_body
    _translate_error(self._body_error_args)
  File "/home/andrew/warthogs/bzr/inventory-delta/bzrlib/smart/message.py", line 361, in _translate_error
    raise errors.ErrorFromSmartServer(error_tuple)
ErrorFromSmartServer: Error received from smart server: ('error', "<bzrlib.groupcompress.GroupCompressVersionedFiles object at 0xb06348c> has no revision ('sha1:98fd3a13366960dc27dcb4b6ddb2b55aca3aae7b',)")

(for scenarios like Pack6RichRoot->2a).]

Anyway, please review.

This is the same patch updated for bzr.dev, and with InterDifferingSerializer restored for local fetches only.  For non-local fetches I'm pretty sure the streaming code path is massively faster based on the timings I've done (even via TCP HPSS to localhost).

[FWIW, if I don't restrict IDS to local-only branches I get test failures like:

(for scenarios like Pack6RichRoot->2a).]

Anyway, please review.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-23: Posted in a previous version of this proposal

I'm not really sure what testing you've done on this, but I'm getting some really strange results.

Specifically, when I push it turns out to be sending xml fragments over the wire. Specifically it seems to be this code:
        elif (not from_format.supports_chks):
            # Source repository doesn't support chks. So we can transmit the
            # inventories 'as-is' and either they are just accepted on the
            # target, or the Sink will properly convert it.
            # (XXX: this assumes that all non-chk formats are understood as-is
            # by any Sink, but that presumably isn't true for foreign repo
            # formats added by bzr-svn etc?)
            return self._get_simple_inventory_stream(revision_ids,
                    missing=missing)

Which means that the raw xml bytes are being transmitted, and then the target side is extracting the xml upcasting and downcasting.

I see that there are code paths in place to do otherwise, but as near as I can tell, "_stream_invs_as_deltas" is only getting called if the *source* format is CHK.

From the profiling I've done, the _generate_root_texts() code is the bulk of the overhead with the new format. But *that* is because all of the data is sent to the server as XML texts, and the server has to do all the work to convert it.

When I just disable the code path that sends 'simple_inventory_stream', then I get:

$ time wbzr push bzr://localhost/test-2a/x
bzr: ERROR: Version present for / in TREE_ROOT

So I have a *strong* feeling the code you've introduced is actually broken, and you just didn't realize it.

review: Needs Fixing

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-28: Posted in a previous version of this proposal

Download full text (7.1 KiB)

So, to further my discussion. While investigating this, I found some odd bits in _stream_invs_as_deltas. For example:

+ def _stream_invs_as_deltas(self, revision_ids, fulltexts=False):
from_repo = self.from_repository

...

+ inventories = self.from_repository.iter_inventories(
+ revision_ids, 'topological')
+ # XXX: ideally these flags would be per-revision, not per-repo (e.g.
+ # streaming a non-rich-root revision out of a rich-root repo back into
+ # a non-rich-root repo ought to be allowed)
+ format = from_repo._format
+ flags = (format.rich_root_data, format.supports_tree_reference)
+ invs_sent_so_far = set([_mod_revision.NULL_REVISION])
+ for inv in inventories:
key = (inv.revision_id,)

...

^- Notice that once you've prepared a delta for "inv.revision_id" you then set "basis_id" as part of "invs_sent_so_far". Which AFAICT means you will sending most of the inventories as complete texts, because you won't actually think you've sent what you actually have done.

+ yield versionedfile.InventoryDeltaContentFactory(
+ key, parents, None, delta, basis_id, flags, from_repo)

The way you've handled the "no parents are available use NULL", means that you
actually create a delta to NULL for *every* revision, and find out if it is the
smallest one. Which seems inefficient versus just using NULL when nothing else
is available. (Note that IDS goes as far as to not use parents that aren't
cached even if they have been sent, and to fall back to using the last-sent
revision otherwise.)

I've changed this loop around a bit, to avoid some duplication and make it a little bit clearer (at least to me) what is going on. I also added a quick LRUCache for the parent inventories that we've just sent, as re-extracting them from the repository is not going to be efficient. (Especially extracting them one at a time.)

Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.

It seems like we should be consistent at that level. What do you think?

As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to:

serializer.require_flags(*self._format_flags)

And that, IMO, is asking for trouble. Yes it is most likely correct as written, but it means that you have to line up the tuple created at the start of _stream_invs_as_deltas:
flags = (format.rich_root_data, format.supports_tree_reference)

with the *arguments* to a function nested 3 calls away.
So how about:

  flags = {'rich_root_data': format.rich_root_data,
           'supports_tree_reference': format.supports_tree_reference
          }
...

and
serializer.require_flags(**self._format_flags)

It isn't vastly better, but ...

So, to further my discussion. While investigating this, I found some odd bits in _stream_invs_as_deltas. For example:

+    def _stream_invs_as_deltas(self, revision_ids, fulltexts=False):
         from_repo = self.from_repository

...

+        inventories = self.from_repository.iter_inventories(
+            revision_ids, 'topological')
+        # XXX: ideally these flags would be per-revision, not per-repo (e.g.
+        # streaming a non-rich-root revision out of a rich-root repo back into
+        # a non-rich-root repo ought to be allowed)
+        format = from_repo._format
+        flags = (format.rich_root_data, format.supports_tree_reference)
+        invs_sent_so_far = set([_mod_revision.NULL_REVISION])
+        for inv in inventories:
             key = (inv.revision_id,)

...

+                for parent_id in parent_ids:
...
+                    if (best_delta is None or
+                        len(best_delta) > len(candidate_delta)):
+                        best_delta = candidate_delta
+                        basis_id = parent_id
+                delta = best_delta
+            invs_sent_so_far.add(basis_id)

+            yield versionedfile.InventoryDeltaContentFactory(
+                key, parents, None, delta, basis_id, flags, from_repo)

Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.

It seems like we should be consistent at that level. What do you think?

As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to:

serializer.require_flags(*self._format_flags)

And that, IMO, is asking for trouble. Yes it is most likely correct as written, but it means that you have to line up the tuple created at the start of _stream_invs_as_deltas:
        flags = (format.rich_root_data, format.supports_tree_reference)

with the *arguments* to a function nested 3 calls away.
So how about:

flags = {'rich_root_data': format.rich_root_data,
           'supports_tree_reference': format.supports_tree_reference
          }
...

and
  serializer.require_flags(**self._format_flags)

It isn't vastly better, but at least they are named arguments, rather than
fairly arbitrary positional ones.

In the end, I don't think that it is ideal that an incremental push of 1
revision will transmit a complete inventory (in delta form). I understand the
limitation is that we currently are unable to 'buffer' these records anywhere
in a persistent state between RPC calls. (Even though we have a bytes-on-the
wire representation, we don't have an index/etc to look them up in.)

I suppose one possibility would be to have the client reject the stream
entirely and request it be re-sent with the basis available. The states are:

1) Pushing to a shared repo, everything should be there that isn't being
pushed, so you should send minimal deltas.

2) Pushing to a newly formed stacked repo. Likely things will be available in
the fallback repo, but we've chosen to *not* give remote smart repositories
access to the fallback.

3) Incremental push to a stacked repository. Will likely already have all the
parent inventories, so we don't need to push them yet again.

Unfortunately we are pesimising 1 & 3 just for the sake of 2. I'd like to see
logic of the form:
  a) These are the revisions I'm going to transmit, assuming other clients will
  have access to this fallback repo.
  b) But the server won't have access to this subset, so this is the
  *accessible* subset which I'll use to determine the bytes-on-the-wire.

Note that this also starts to handle having clients push up parent inventory
texts opportunistically, rather than waiting for the remote side to say "I
don't actually have this parent's inventory, because I'm stacked and you hid my
fallback from me."

Of course, the get_stream_for_missing_keys code is wired up such that clients
ignore the response about missing inventories if the target is a stacked repo,
and just always force up an extra fulltext copy of all parent inventories
anyway. (Because of the bugs in 1.13, IIRC.)

It doesn't have to happen today, but the code as written means that changing a
single file, committing, and pushing that up to a stacked branch is going to
upload at least 2 "complete/full" inventory deltas. (The rev you are pushing,
and its parent.)

For *me* the ssh handshake is probably signficantly more than the time to push
up 2 inventories of bzr.dev, but that tradeoff is very different for people
working over a tiny bandwidth from their GPRS phone, trying to push a critical
bugfix for a Launchpad branch.

Again, the code as written fails to actually transmit data over the smart protocol, I get:
$ time awbzr branch bzrtools_1.9/ -r -100 bzr://localhost/test-2a/b
bzr: ERROR: Version present for / in TREE_ROOT

even with the fixes I've mentioned so far.

Also, I tried to see what the serialized form over the wire was, and I think the code was:

lines.sort()
        lines[0] = "format: %s\n" % InventoryDeltaSerializer.FORMAT_1
        lines[1] = "parent: %s\n" % old_name
        lines[2] = "version: %s\n" % new_name
        lines[3] = "versioned_root: %s\n" % self._serialize_bool(
            self._versioned_root)
        lines[4] = "tree_references: %s\n" % self._serialize_bool(
            self._tree_references)

^- However, if I traced it correctly, old_name == 'basis_id' which is a string, but new_name == key, aka a tuple.

So we're clearly doing something wrong here.

Digging further, the bytes on the wire don't include things like:
 InventoryDeltaContentFactory.parents (list/tuple of tuple keys)
 InventoryDeltaContentFactory.sha1

Which means that I don't see any way to get perfect mapping out of the deserialized form. (I suppose you can then go to the revision texts to find the parent mapping, but that really isn't nice to do.)

Anyway, it means that what you get out the other side of translating into and then back out of the serialized form is not exactly equal to what went in, which seems very bad (and likely to *not* break when using local actions, and then have it break when using remote ones.)

So I've made some updates into lp:~jameinel/bzr/1.18-inventory-delta

but ultimately this isn't ready to land.

review: Needs Fixing

Revision history for this message

Robert Collins (lifeless) wrote on 2009-07-28: Posted in a previous version of this proposal

On Tue, 2009-07-28 at 21:51 +0000, John A Meinel wrote:
>
> I suppose one possibility would be to have the client reject the
> stream entirely and request it be re-sent with the basis available.

If we do this then... The server just asks for that inventory again. The
problem though, is that *all* inventories will need to be asked for
again, which could be pathologically bad. It is possible to write a file
and restore it later without network API changes, just pass in write
group save tokens. I don't think we need to block on that for this patch
though - sending a full inventory is no worse than sending the full text
of a file, which 2a does as well - and ChangeLog files etc are very big
- as large or larger than an inventory, I would expect.

> Of course, the get_stream_for_missing_keys code is wired up such that
> clients
> ignore the response about missing inventories if the target is a
> stacked repo,
> and just always force up an extra fulltext copy of all parent
> inventories
> anyway. (Because of the bugs in 1.13, IIRC.)

The condition is wrong... it probably should say 'if we're not told
anything is missing..' - and it should be guarded such that the new push
verb (introduced recently?) doesn' trigger this paranoia.

> Anyway, it means that what you get out the other side of translating
> into and then back out of the serialized form is not exactly equal to
> what went in, which seems very bad (and likely to *not* break when
> using local actions, and then have it break when using remote ones.)

This is a reason to serialise/deserialise locally ;) - at least to start
with.

-Rob

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-07-29: Posted in a previous version of this proposal

Robert Collins wrote:
> On Tue, 2009-07-28 at 21:51 +0000, John A Meinel wrote:
> >
> > I suppose one possibility would be to have the client reject the
> > stream entirely and request it be re-sent with the basis available.
>
> If we do this then... The server just asks for that inventory again. The
> problem though, is that *all* inventories will need to be asked for
> again, which could be pathologically bad. It is possible to write a file
> and restore it later without network API changes, just pass in write
> group save tokens. I don't think we need to block on that for this patch
> though - sending a full inventory is no worse than sending the full text
> of a file, which 2a does as well - and ChangeLog files etc are very big
> - as large or larger than an inventory, I would expect.

Also, we currently only allow one “get missing keys” stream after the original
stream. So if we optimistically send just a delta then having the client
reject it means that the next time we have to send not only a full delta closure
the second time, we also need to send all the inventory parents, because there's
no further opportunity for the client to ask for more keys. This is potentially
even worse than just sending a single fulltext.

I think sending a fulltext cross-format is fine. We aren't trying to maximally
optimise cross-format fetches, just make them acceptable. We should file a bug
for improving this, but I don't think it's particularly urgent.

FWIW, I think the correct way to fix this is to allow the receiving repository
to somehow store inventory-deltas that it can't add yet, so that it can work as
part of the regular suspend_write_group and get missing keys logic. Then the
sender can optimistically send a delta, and the receiver can either insert it or
store it in a temporary upload pack and ask for the parent keys, just as we
already do for all other types of deltas. This is optimal in the case where the
recipient already has the basis, and only requires one extra roundtrip in the
case that it doesn't. We can perhaps do this by creating an upload pack just to
store the fulltexts of inventory-deltas, including that pack in the resume
tokens, but making sure never to insert in the final commit.

-Andrew.

Robert Collins wrote:
> On Tue, 2009-07-28 at 21:51 +0000, John A Meinel wrote:
> > 
> > I suppose one possibility would be to have the client reject the
> > stream entirely and request it be re-sent with the basis available.
> 
> If we do this then... The server just asks for that inventory again. The
> problem though, is that *all* inventories will need to be asked for
> again, which could be pathologically bad. It is possible to write a file
> and restore it later without network API changes, just pass in write
> group save tokens. I don't think we need to block on that for this patch
> though - sending a full inventory is no worse than sending the full text
> of a file, which 2a does as well - and ChangeLog files etc are very big
> - as large or larger than an inventory, I would expect.

Also, we currently only allow one “get missing keys” stream after the original
stream.  So if we  optimistically send just a delta then having the client
reject it means that the next time we have to send not only a full delta closure
the second time, we also need to send all the inventory parents, because there's
no further opportunity for the client to ask for more keys.  This is potentially
even worse than just sending a single fulltext.

I think sending a fulltext cross-format is fine.  We aren't trying to maximally
optimise cross-format fetches, just make them acceptable.  We should file a bug
for improving this, but I don't think it's particularly urgent.

FWIW, I think the correct way to fix this is to allow the receiving repository
to somehow store inventory-deltas that it can't add yet, so that it can work as
part of the regular suspend_write_group and get missing keys logic.  Then the
sender can optimistically send a delta, and the receiver can either insert it or
store it in a temporary upload pack and ask for the parent keys, just as we
already do for all other types of deltas.  This is optimal in the case where the
recipient already has the basis, and only requires one extra roundtrip in the
case that it doesn't.  We can perhaps do this by creating an upload pack just to
store the fulltexts of inventory-deltas, including that pack in the resume
tokens, but making sure never to insert in the final commit.
 
-Andrew.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-03: Posted in a previous version of this proposal

Download full text (5.1 KiB)

John A Meinel wrote:
> Review: Needs Fixing
> So, to further my discussion. While investigating this, I found some odd bits
> in _stream_invs_as_deltas. For example:
[...]
>
> ^- Notice that once you've prepared a delta for "inv.revision_id" you then set
> "basis_id" as part of "invs_sent_so_far". Which AFAICT means you will sending
> most of the inventories as complete texts, because you won't actually think
> you've sent what you actually have done.

Oops, right. I've merged your fix for that.

> + yield versionedfile.InventoryDeltaContentFactory(
> + key, parents, None, delta, basis_id, flags, from_repo)
>
> The way you've handled the "no parents are available use NULL", means that you
> actually create a delta to NULL for *every* revision, and find out if it is the
> smallest one. Which seems inefficient versus just using NULL when nothing else
> is available. (Note that IDS goes as far as to not use parents that aren't
> cached even if they have been sent, and to fall back to using the last-sent
> revision otherwise.)
>
> I've changed this loop around a bit, to avoid some duplication and make it a
> little bit clearer (at least to me) what is going on. I also added a quick
> LRUCache for the parent inventories that we've just sent, as re-extracting
> them from the repository is not going to be efficient. (Especially extracting
> them one at a time.)

Thanks, I've merged your fix.

> Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.
>
> It seems like we should be consistent at that level. What do you think?

It would be nice, but not vital.

> As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to:
[...]
> serializer.require_flags(**self._format_flags)
>
> It isn't vastly better, but at least they are named arguments, rather than
> fairly arbitrary positional ones.

Ok, I'll do that.

> In the end, I don't think that it is ideal that an incremental push of 1
> revision will transmit a complete inventory (in delta form). I understand the
> limitation is that we currently are unable to 'buffer' these records anywhere
> in a persistent state between RPC calls. (Even though we have a bytes-on-the
> wire representation, we don't have an index/etc to look them up in.)

(As said elsewhere on this thread...)

This is only an issue for a cross-format push, which doesn't have to be
maximally efficient. It just has to be reasonable. We can look at doing better
later if we want, but this is already a massive improvement in terms of data
transferred compared to IDS.

[...]
> For *me* the ssh handshake is probably signficantly more than the time to push
> up 2 inventories of bzr.dev, but that tradeoff is very different for people
> working over a tiny bandwidth from their GPRS phone, trying to push a critical
> bugfix for a Launchpad branch.

Then they probably should go to the effort of having their local branch in a
matching format :)

We're not intending to use inventory-deltas for anything but cross-format
fetches AFAIK.

> Again, the code as written fails to actually transmit data over the...

John A Meinel wrote:
> Review: Needs Fixing
> So, to further my discussion. While investigating this, I found some odd bits
> in _stream_invs_as_deltas. For example:
[...]
> 
> ^- Notice that once you've prepared a delta for "inv.revision_id" you then set
> "basis_id" as part of "invs_sent_so_far". Which AFAICT means you will sending
> most of the inventories as complete texts, because you won't actually think
> you've sent what you actually have done.

Oops, right.  I've merged your fix for that.

> +            yield versionedfile.InventoryDeltaContentFactory(
> +                key, parents, None, delta, basis_id, flags, from_repo)
> 
> The way you've handled the "no parents are available use NULL", means that you
> actually create a delta to NULL for *every* revision, and find out if it is the
> smallest one. Which seems inefficient versus just using NULL when nothing else
> is available. (Note that IDS goes as far as to not use parents that aren't
> cached even if they have been sent, and to fall back to using the last-sent
> revision otherwise.)
> 
> I've changed this loop around a bit, to avoid some duplication and make it a
> little bit clearer (at least to me) what is going on. I also added a quick
> LRUCache for the parent inventories that we've just sent, as re-extracting
> them from the repository is not going to be efficient. (Especially extracting
> them one at a time.)

Thanks, I've merged your fix.

> Going for that last line... 'key' is a tuple, and 'parents' is a tuple of tuples (or a list of tuples), but 'basis_id' is a simple string.
> 
> It seems like we should be consistent at that level. What do you think?

It would be nice, but not vital.

> As for "flags", wouldn't it be better to pass that in as a *dict*. You pass it directly to:
[...]
>   serializer.require_flags(**self._format_flags)
> 
> It isn't vastly better, but at least they are named arguments, rather than
> fairly arbitrary positional ones.

Ok, I'll do that.

(As said elsewhere on this thread...)

This is only an issue for a cross-format push, which doesn't have to be
maximally efficient.  It just has to be reasonable.  We can look at doing better
later if we want, but this is already a massive improvement in terms of data
transferred compared to IDS.

Then they probably should go to the effort of having their local branch in a
matching format :)

We're not intending to use inventory-deltas for anything but cross-format
fetches AFAIK.

> Again, the code as written fails to actually transmit data over the smart protocol, I get:
> $ time awbzr branch bzrtools_1.9/ -r -100 bzr://localhost/test-2a/b
> bzr: ERROR: Version present for / in TREE_ROOT
> 
> even with the fixes I've mentioned so far.

I've found and fixed this.  It turned out to be a bug in inventory_delta.py (it
couldn't round-trip deltas from non-rich-root inventories).

[...]
> ^- However, if I traced it correctly, old_name == 'basis_id' which is a string, but new_name == key, aka a tuple.
> 
> So we're clearly doing something wrong here.

Yes.  I found and fixed this too.  I've also made delta_to_lines raise TypeError
if it gets non-strs for old_name or new_name, because that's always going to be
a mistake.  The current behaviour (which will silently str a tuple) isn't very
helpful.

> Digging further, the bytes on the wire don't include things like:
>  InventoryDeltaContentFactory.parents (list/tuple of tuple keys)
>  InventoryDeltaContentFactory.sha1

Inventories don't have parent lists, IIRC?  Revisions do, and text versions do,
but inventories don't.  (They may have a compression parent, but not a semantic
parent.)

The sha1 is never set, although perhaps it should be.

These two attributes are really just there to satisfy the ContentFactory
interface.  After a separate discussion with Robert I'm goint get rid of
InventoryDeltaContentFactory and instead make an explicit inventory-delta
substream that will simply carry FulltextContentFactories of serialised deltas.
(I'm not thrilled at an unnecessary serialise/deseralise pass this will incur in
process, but it doesn't seem to be the bottleneck at the moment.)

[...]
> So I've made some updates into lp:~jameinel/bzr/1.18-inventory-delta

Thanks, merged.  (Including, for now, the disabling of
InterDifferingSerializer... FWIW, I was doing that by commenting out the
registration call rather than hacking is_compatible :P)

I'll send an updated merge proposal once I've done the change to use a new
substream rather than add a content factory that isn't really a factory for
bytes in a versionedfile.

-Andrew.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-03: Posted in a previous version of this proposal

On Mon, 2009-08-03 at 06:33 +0000, Andrew Bennetts wrote:
>
> > Digging further, the bytes on the wire don't include things like:
> > InventoryDeltaContentFactory.parents (list/tuple of tuple keys)
> > InventoryDeltaContentFactory.sha1
>
> Inventories don't have parent lists, IIRC? Revisions do, and text
> versions do,
> but inventories don't. (They may have a compression parent, but not a
> semantic
> parent.)

Indeed. I wouldn't give them one in the deltas; let repositories that
care calculate one.
parents=None is valid for the interface..

> The sha1 is never set, although perhaps it should be.

No it shouldn't. the sha1 is format specific, and we don't convert back
to the original format to check it, so it would, at best, be discarded.
sha1 = None is valid for the interface as well

_Rob

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-03: Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2009-08-03 at 06:33 +0000, Andrew Bennetts wrote:
>>> Digging further, the bytes on the wire don't include things like:
>>> InventoryDeltaContentFactory.parents (list/tuple of tuple keys)
>>> InventoryDeltaContentFactory.sha1
>> Inventories don't have parent lists, IIRC? Revisions do, and text
>> versions do,
>> but inventories don't. (They may have a compression parent, but not a
>> semantic
>> parent.)
>
> Indeed. I wouldn't give them one in the deltas; let repositories that
> care calculate one.
> parents=None is valid for the interface..

So... all of our inventory texts have parents in all repository formats
that exist *today*.

It isn't feasible to have them query the revisions. Because in "knit"
streams, they haven't received the revision data until after the
inventory data.

At first I thought we had fixed inventories in CHK, but I checked, and
it still has parents. And there is other code that assumes this fact.

Case in point, my new bundle sending/receiving code. It intentionally
queries the inventories.get_parent_map because
revisions.get_parent_map() has *not* been filled in yet.
(Bundle.insert_revisions always inserts objects as texts, inventories,
revisions.)

John
=:->

>
>> The sha1 is never set, although perhaps it should be.
>
> No it shouldn't. the sha1 is format specific, and we don't convert back
> to the original format to check it, so it would, at best, be discarded.
> sha1 = None is valid for the interface as well
>
> _Rob

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkp25csACgkQJdeBCYSNAANx4ACeJKMpECIH1YR6yJ5+RxNCNgIW
jGAAn2eOWm6DNJNWweSo4NF5CZ34OJiC
=Hyh5
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-05: Posted in a previous version of this proposal

On Mon, 2009-08-03 at 13:30 +0000, John A Meinel wrote:
>
> > Indeed. I wouldn't give them one in the deltas; let repositories
> that
> > care calculate one.
> > parents=None is valid for the interface..
>
> So... all of our inventory texts have parents in all repository
> formats
> that exist *today*.
>
> It isn't feasible to have them query the revisions. Because in "knit"
> streams, they haven't received the revision data until after the
> inventory data.

ugh. ugh ugh ugh swear swear swear.

> At first I thought we had fixed inventories in CHK, but I checked, and
> it still has parents. And there is other code that assumes this fact.
>
> Case in point, my new bundle sending/receiving code. It intentionally
> queries the inventories.get_parent_map because
> revisions.get_parent_map() has *not* been filled in yet.
> (Bundle.insert_revisions always inserts objects as texts, inventories,
> revisions.)

Why does it need to do that though?

We're going to have to break this chain sometime.

If we can't at this stage, we'll need to either supply parents on the
serialised deltas, or add a parents field to the inventory serialisation
form. I prefer the former, myself.

-Rob

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-05:

Ok, third time lucky! :)

Some notable changes since the last review:

  - Uses an inventory-deltas substream rather than inventing a new content factory;
  - Adds a few debug flags to control whether this code path is used or not;
  - Fixes some bugs relating to rich roots and to deletes in inventory_delta.py!

I was surprised to realise that, despite the expectations of inventory_delta.py, non-rich-root repos can have roots with IDs other than 'TREE_ROOT'. What they can't have is a root entry with a revision that doesn't match the inventory's revision.

This beats InterDifferingSerializer for a push of bzr.dev -r2000 from 1.9->2a over localhost HPSS by about 3x (9.5min vs. ~30min, although not on a totally quiescent laptop). LocalTransport push with IDS is about 10x faster than that, though.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-05: Posted in a previous version of this proposal

Robert Collins wrote:
[...]
> If we can't at this stage, we'll need to either supply parents on the
> serialised deltas, or add a parents field to the inventory serialisation
> form. I prefer the former, myself.

FWIW, my current code sends the parents for the inventory in the .parents of the
FulltextContentFactory holding the serialised inventory-delta (and uses that
when calling add_inventory_by_delta).

So my patch maintains the status quo here. (I'm actually fairly sure that the
version of the code John reviewed which inspired this discussion was doing this
too, but that's irrelevant at this point.)

-Andrew.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-05:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Ok, third time lucky! :)
>
> Some notable changes since the last review:
>
> - Uses an inventory-deltas substream rather than inventing a new content factory;
> - Adds a few debug flags to control whether this code path is used or not;
> - Fixes some bugs relating to rich roots and to deletes in inventory_delta.py!
>
> I was surprised to realise that, despite the expectations of inventory_delta.py, non-rich-root repos can have roots with IDs other than 'TREE_ROOT'. What they can't have is a root entry with a revision that doesn't match the inventory's revision.
>
> This beats InterDifferingSerializer for a push of bzr.dev -r2000 from 1.9->2a over localhost HPSS by about 3x (9.5min vs. ~30min, although not on a totally quiescent laptop). LocalTransport push with IDS is about 10x faster than that, though.

So do I understand correctly that:

IDS => bzr:// ~30min
InventoryDelta => bzr:// 9.5min
IDS => file:// < 1min

So when I was looking at InventoryDelta (before we fixed it to actually
send deltas :) the #1 overhead was actually in "_generate_root_texts()"
because that was iterating over revision_trees and having to extract all
of the inventories yet again.

Anyway I'll give the new code a look over. Unfortunately there are still
a lot of conflated factors, like code that wants to transmit all of the
"texts" before we transmit *any* "inventories" content, which means
somewhere you need to do buffering.

IDS "works" by batching 100 at a time, so it only buffers the 100 or so
inventory deltas before it writes the root texts to the target repo.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkp5gmoACgkQJdeBCYSNAANXLgCgmIFEKd61nvF/69U6vcpgspWe
tIQAoJYnfIbccZmgvWKAL7cwFNxz6H+e
=4Hyp
-----END PGP SIGNATURE-----

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-05:

John Arbash Meinel wrote:
[...]
> So do I understand correctly that:
>
> IDS => bzr:// ~30min
> InventoryDelta => bzr:// 9.5min
> IDS => file:// < 1min

Yes, that's right.

> So when I was looking at InventoryDelta (before we fixed it to actually
> send deltas :) the #1 overhead was actually in "_generate_root_texts()"
> because that was iterating over revision_trees and having to extract all
> of the inventories yet again.

Right, and that's probably still the bottleneck. But even with that bottleneck
it's much faster over the network (even a very fast "network" like the loopback
interface), so I think it's worth merging.

> Anyway I'll give the new code a look over. Unfortunately there are still
> a lot of conflated factors, like code that wants to transmit all of the
> "texts" before we transmit *any* "inventories" content, which means
> somewhere you need to do buffering.
>
> IDS "works" by batching 100 at a time, so it only buffers the 100 or so
> inventory deltas before it writes the root texts to the target repo.

Yeah. It might be nice to somehow arrange similar batching when sending streams
over the network. If we can arrange to make these streams self-contained it
would make it easier to do incremental packing too.

Actually... all we need for incremental packing (which would fix the
"inventory-delta push to 2a is very bloated on disk until the stream is done")
is a way to be able to force a repack of an uncommitted pack (i.e. in upload/,
not inserted). That's probably not too hard to add, and then the StreamSink can
trigger that every N records or when the pack reaches N bytes or something.
I'll have a play with this.

-Andrew.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-05:

John A Meinel wrote:
[...]
> So do I understand correctly that:
>
> IDS => bzr:// ~30min
> InventoryDelta => bzr:// 9.5min
> IDS => file:// < 1min

Also:

InventoryDelta => file:// ~9.5min

(Timings are intentionally a bit approximate, I haven't kept my laptop perfectly
idle while running these tests, etc.)

-Andrew.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-05:

> John A Meinel wrote:
> [...]
> > So do I understand correctly that:
> >
> > IDS => bzr:// ~30min
> > InventoryDelta => bzr:// 9.5min
> > IDS => file:// < 1min
>
> Also:
>
> InventoryDelta => file:// ~9.5min
>
> (Timings are intentionally a bit approximate, I haven't kept my laptop
> perfectly
> idle while running these tests, etc.)
>
> -Andrew.

So something isn't quite right with my timings as:

wbzr init-repo --2a test-2a
time wbzr push -d ../bzr/bzr.dev -r 2000 test-2a/bzr -DIDS:always
11m12.889s

I wonder if you didn't make a mistake in your timing of IDS.

In my timing of IDS versus InventoryDelta for bzrtools, it was more:

15.8s time wbzr push -d bzrtools bzr://localhost/test-2a/bzrt
19.1s time wbzr push -d bzrtools test-2a/bzrt

Which shows that IDS was actually *slower* than pushing using InventoryDelta over the local loopback.

Given the numbers you quote, 1m is *much* closer to just the simple:
bzr init-repo --1.9 test-19
bzr branch ../bzr/bzr.dev test-19/bzr

Which would be the simple non-converting time.

I'll be running a couple more tests to see if the new refactoring of IDS that you've done has made anything slower, but at least at a first glance the only thing I could find that would be better with IDS is that it doesn't have a second pass over all inventories in order to generate the root texts keys. And that certainly wouldn't explain 9.5m => 1.0m.

I suggest you run your timing test again, and make sure you've set everything up correctly.

I at least thought my laptop was faster than yours, though I'm on Windows and you may have upgraded your laptop since then.

$ time wbzr push -d ../bzr/bzr.dev -r 2000 bzr://localhost/test-2a/bzr -DIDS:never
real 4m32.578s

This is 4m32s down from 11m12s for IDS (file:// to file://). Maybe something did get broken. I'll be running some more tests.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-06:

John A Meinel wrote:
[...]
> So something isn't quite right with my timings as:
>
> wbzr init-repo --2a test-2a
> time wbzr push -d ../bzr/bzr.dev -r 2000 test-2a/bzr -DIDS:always
> 11m12.889s
>
> I wonder if you didn't make a mistake in your timing of IDS.

Yeah, that's likely, I think I probably forgot to init --2a on that run. (I
think I reran it with the correct setup then copy-&-pasted the number from the
wrong one). The other numbers should be fine.

My corrected time for that is 23m 1s! bzr.dev's IDS is much slower for me (29.5
minutes), so I certainly haven't regressed the performance. And the test suite
says I haven't regressed the correctness.

So my patch is looking better and better...

> In my timing of IDS versus InventoryDelta for bzrtools, it was more:
>
> 15.8s time wbzr push -d bzrtools bzr://localhost/test-2a/bzrt
> 19.1s time wbzr push -d bzrtools test-2a/bzrt
>
> Which shows that IDS was actually *slower* than pushing using InventoryDelta
> over the local loopback.

Right, I'm seeing that too with bzr -r2000. (And glancing at the log, I think
it's still slower even if you don't count the time IDS spends autopacking.)

[...]
> I'll be running a couple more tests to see if the new refactoring of IDS that
> you've done has made anything slower, but at least at a first glance the only
> thing I could find that would be better with IDS is that it doesn't have a
> second pass over all inventories in order to generate the root texts keys. And
> that certainly wouldn't explain 9.5m => 1.0m.

I've taken another look at my refactoring of IDS, and I don't see any obvious
problems with my refactoring.

> I suggest you run your timing test again, and make sure you've set everything
> up correctly.
>
> I at least thought my laptop was faster than yours, though I'm on Windows and
> you may have upgraded your laptop since then.

I haven't upgraded my laptop for, well, years :)

> $ time wbzr push -d ../bzr/bzr.dev -r 2000 bzr://localhost/test-2a/bzr -DIDS:never
> real 4m32.578s
>
> This is 4m32s down from 11m12s for IDS (file:// to file://). Maybe something
> did get broken. I'll be running some more tests.

Where did you get to with your tests?

-Andrew.

John A Meinel wrote:
[...]
> So something isn't quite right with my timings as:
> 
> wbzr init-repo --2a test-2a
> time wbzr push -d ../bzr/bzr.dev -r 2000 test-2a/bzr -DIDS:always
> 11m12.889s
> 
> I wonder if you didn't make a mistake in your timing of IDS.

Yeah, that's likely, I think I probably forgot to init --2a on that run.  (I
think I reran it with the correct setup then copy-&-pasted the number from the
wrong one).  The other numbers should be fine.

My corrected time for that is 23m 1s!  bzr.dev's IDS is much slower for me (29.5
minutes), so I certainly haven't regressed the performance.  And the test suite
says I haven't regressed the correctness.

So my patch is looking better and better...

> In my timing of IDS versus InventoryDelta for bzrtools, it was more:
> 
> 15.8s  time wbzr push -d bzrtools bzr://localhost/test-2a/bzrt
> 19.1s  time wbzr push -d bzrtools test-2a/bzrt
> 
> Which shows that IDS was actually *slower* than pushing using InventoryDelta
> over the local loopback.

Right, I'm seeing that too with bzr -r2000.  (And glancing at the log, I think
it's still slower even if you don't count the time IDS spends autopacking.)

I've taken another look at my refactoring of IDS, and I don't see any obvious
problems with my refactoring.

> I suggest you run your timing test again, and make sure you've set everything
> up correctly.
> 
> I at least thought my laptop was faster than yours, though I'm on Windows and
> you may have upgraded your laptop since then.

I haven't upgraded my laptop for, well, years :)

> $ time wbzr push -d ../bzr/bzr.dev -r 2000 bzr://localhost/test-2a/bzr -DIDS:never
> real    4m32.578s
> 
> This is 4m32s down from 11m12s for IDS (file:// to file://). Maybe something
> did get broken. I'll be running some more tests.

Where did you get to with your tests?

-Andrew.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-06:

Download full text (3.2 KiB)

+* InterDifferingSerializer has been removed. The transformations it
+ provided are now done automatically by StreamSource. (Andrew Bennetts)
+

^- So for starters this isn't true for now at least.

You added the debug flags "-DIDS:never" and "-DIDS:always". I wonder if they wouldn't be better as IDS_never and IDS_always, just because that would seem to fit better with how we have generally named them. Not a big deal.

+
+
+def _new_root_data_stream(
+ root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):

^- I find the wrapping to be a bit strange, since this and the next function don't have any parameters on the first line. More importantly, though, these should probably at least have a minimal docstring to help understand what is going on.

It is nice that you were able to factor out these helpers so that we can consistently generate the root keys and their parents. It is a shame that _new_root_data_stream is being called on:
rev_id_to_root_id = self._find_root_ids(revs, parent_map, graph)

Which is implemented by iterating all revision trees, which requires parsing all of the revisions. Though it then immediately goes on to use those trees to generate the inventory deltas.

    def _find_root_ids(self, revs, parent_map, graph):
        revision_root = {}
        for tree in self.iter_rev_trees(revs):
            revision_id = tree.inventory.root.revision
            root_id = tree.get_root_id()
            revision_root[revision_id] = root_id
        # Find out which parents we don't already know root ids for
        parents = set()
        for revision_parents in parent_map.itervalues():
            parents.update(revision_parents)
        parents.difference_update(revision_root.keys() + [NULL_REVISION])

^- We've certainly had this pattern enough that we really should consider factoring it out into a helper. But it seems the preferred form is:

parents = set()
map(parents.update, parent_map.itervalues())
parents.difference_update(revision_root)
parents.discard(NULL_REVISION)

That should perform better than what we have. (No need to generate a list from keys() nor append to it NULL_REVISION, creating potentially a 3rd all-keys list.)

        # Limit to revisions present in the versionedfile
        parents = graph.get_parent_map(parents).keys()
        for tree in self.iter_rev_trees(parents):
            root_id = tree.get_root_id()
            revision_root[tree.get_revision_id()] = root_id
        return revision_root

I've been doing some more performance testing, and I've basically seen stuff
like:

bzr branch mysql-525 local -DIDS:always
3m15s

bzr branch mysql-525 local -DIDS:always --XML8-nocopy
2m30s

bzr branch mysql-525 local -DIDS:never
4m01s

bzr branch mysql-525 bzr://localhost/remote -DIDS:never
3m25s

So it is slower, but only about 25% slower (except for the extra-optimized
XML8-nocopy).

I think the new code converts almost as fast, and the extra repacking from IDS
actually costs it a bit of time (other than saving it disk space). The only
major deficit at this point is that there is no progress indication. So I'd
probably stick with IDS for local operations just because of that.
...

+* InterDifferingSerializer has been removed.  The transformations it
+  provided are now done automatically by StreamSource.  (Andrew Bennetts)
+

^- So for starters this isn't true for now at least.

+
+
+def _new_root_data_stream(
+    root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):

It is nice that you were able to factor out these helpers so that we can consistently generate the root keys and their parents. It is a shame that _new_root_data_stream is being called on:
        rev_id_to_root_id = self._find_root_ids(revs, parent_map, graph)

Which is implemented by iterating all revision trees, which requires parsing all of the revisions. Though it then immediately goes on to use those trees to generate the inventory deltas.

^- We've certainly had this pattern enough that we really should consider factoring it out into a helper. But it seems the preferred form is:

parents = set()
map(parents.update, parent_map.itervalues())
parents.difference_update(revision_root)
parents.discard(NULL_REVISION)

That should perform better than what we have. (No need to generate a list from keys() nor append to it NULL_REVISION, creating potentially a 3rd all-keys list.)

I've been doing some more performance testing, and I've basically seen stuff
like:

bzr branch mysql-525 local -DIDS:always
3m15s

bzr branch mysql-525 local -DIDS:always --XML8-nocopy
2m30s

bzr branch mysql-525 local -DIDS:never
4m01s

bzr branch mysql-525 bzr://localhost/remote -DIDS:never
3m25s

So it is slower, but only about 25% slower (except for the extra-optimized
XML8-nocopy).

I'll try to finish reviewing the rest, but at least my initial performance
concerns have been addressed.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-06:

Download full text (10.9 KiB)

Overall, this is looking pretty good. A few small tweaks here and there, and possible concerns. But only one major concern.

You changed the default return value of "iter_inventories()" to return 'unordered' results, which means that "revision_trees()" also not returns 'unordered' results. Which is fairly serious, and I was able to find more than one code path that was relying on the ordering of revision_trees(). So I think the idea is sound, but it needs to be exposed in a backwards compatible manner by making it a flag that defaults to "as-requested" ordering, and then we fix up code paths as we can.

I don't specifically need to review that change, though. And I don't really want to review this patch yet again :). So I'm voting tweak, and you can submit if you agree with my findings.

119 + from bzrlib.graph import FrozenHeadsCache
120 + graph = FrozenHeadsCache(graph)
121 + new_roots_stream = _new_root_data_stream(
122 + root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
123 + return [('texts', new_roots_stream)]

^- I'm pretty sure if we are using FrozenHeadsCache that we really want to use KnownGraph instead. We have a dict 'parent_map' which means we can do much more efficient heads() checks since the whole graph is already loaded. This is a minor thing we can do later, but it would probably be good not to forget.

...

392 - elif last_modified[-1] == ':':
393 - raise errors.BzrError('special revisionid found: %r' % line)
394 - if not delta_tree_references and content.startswith('tree\x00'):
395 + elif newpath_utf8 != 'None' and last_modified[-1] == ':':
396 + # Deletes have a last_modified of null:, but otherwise special
397 + # revision ids should not occur.
398 + raise errors.BzrError('special revisionid found: %r' % line)
399 + if delta_tree_references is False and content.startswith('tree\x00'):

^- What does "newpath_utf8 != 'None'" mean if someone does:

touch None
bzr add None
bzr commit -m "adding None"

Is this a serialization bug waiting to be exposed?

I guess not as it seems the paths are always prefixed with "/", right?

(So a path of None would actually be "newpath_utf8 == '/None'")

However, further down (sorry about the bad indenting):

413 if newpath_utf8 == 'None':
414 newpath = None

^- here you set "newpath=None" but you *don't* set "newpath_utf8" to None.

415 + elif newpath_utf8[:1] != '/':
416 + raise errors.BzrError(
417 + "newpath invalid (does not start with /): %r"
418 + % (newpath_utf8,))
419 else:
420 + # Trim leading slash
421 + newpath_utf8 = newpath_utf8[1:]
422 newpath = newpath_utf8.decode('utf8')
423 + content_tuple = tuple(content.split('\x00'))
424 + if content_tuple[0] == 'deleted':
425 + entry = None
426 + else:
427 + entry = _parse_entry(
428 + newpath_utf8, file_id, parent_id, last_modified,
429 + content_tuple)

And then here "newpath_utf8" is passed to _parse_entry.

Now I realize this is probably caught by "content_tuple[0] == 'deleted'". Though it feels a bit icky to rely on "newpath_utf8" in one portion and "content_tuple[0]" in another (since they can potentially be out of sync.)

I think if we just force an early error with:
if newpath_utf8 == 'None':
newpath = None
...

Overall, this is looking pretty good. A few small tweaks here and there, and possible concerns. But only one major concern.

I don't specifically need to review that change, though. And I don't really want to review this patch yet again :). So I'm voting tweak, and you can submit if you agree with my findings.

119	+ from bzrlib.graph import FrozenHeadsCache
120	+ graph = FrozenHeadsCache(graph)
121	+ new_roots_stream = _new_root_data_stream(
122	+ root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
123	+ return [('texts', new_roots_stream)]

...

392	- elif last_modified[-1] == ':':
393	- raise errors.BzrError('special revisionid found: %r' % line)
394	- if not delta_tree_references and content.startswith('tree\x00'):
395	+ elif newpath_utf8 != 'None' and last_modified[-1] == ':':
396	+ # Deletes have a last_modified of null:, but otherwise special
397	+ # revision ids should not occur.
398	+ raise errors.BzrError('special revisionid found: %r' % line)
399	+ if delta_tree_references is False and content.startswith('tree\x00'):

^- What does "newpath_utf8 != 'None'" mean if someone does:

touch None
bzr add None
bzr commit -m "adding None"

Is this a serialization bug waiting to be exposed?

I guess not as it seems the paths are always prefixed with "/", right?

(So a path of None would actually be "newpath_utf8 == '/None'")

However, further down (sorry about the bad indenting):

413	 if newpath_utf8 == 'None':
414	newpath = None

^- here you set "newpath=None" but you *don't* set "newpath_utf8" to None.

415	+ elif newpath_utf8[:1] != '/':
416	+ raise errors.BzrError(
417	+ "newpath invalid (does not start with /): %r"
418	+ % (newpath_utf8,))
419	else:
420	+ # Trim leading slash
421	+ newpath_utf8 = newpath_utf8[1:]
422	newpath = newpath_utf8.decode('utf8')
423	+ content_tuple = tuple(content.split('\x00'))
424	+ if content_tuple[0] == 'deleted':
425	+ entry = None
426	+ else:
427	+ entry = _parse_entry(
428	+ newpath_utf8, file_id, parent_id, last_modified,
429	+ content_tuple)

And then here "newpath_utf8" is passed to _parse_entry.

I think if we just force an early error with:
if newpath_utf8 == 'None':
  newpath = None
  newpath_utf8 = None

Then if there is an issue at least it fails rather than succeeding to create a new file with the name "None".

if utf8_path.startswith('/'):

^- If this is a core routine (something called for every path) then:
if utf8_path[:1] == '/':

*is* faster than .startswith() because you 
1) Don't have a function call
2) Don't have an attribute lookup

I'm assuming this is a function that gets called a lot. If not, don't worry about it.

...

566	+ if required_version < (1, 18):
567	+ # Remote side doesn't support inventory deltas. Wrap the stream to
568	+ # make sure we don't send any. If the stream contains inventory
569	+ # deltas we'll interrupt the smart insert_stream request and
570	+ # fallback to VFS.
571	+ stream = self._stop_stream_if_inventory_delta(stream)

^- it seems a bit of a shame that if we don't support deltas we fall back to
VFS completely, rather than trying something intermediate (like falling back to
the original code path of sending full inventory texts, or IDS, or...)

I think we are probably okay, but this code at least raises a flag. I expect a
bug report along the lines of "fetching between 1.18 and older server is very
slow". I haven't looked at all the code paths to determine if 1.18 will have
regressed against a 1.17 server. Especially when *not* converting formats. Have
you at least manually tested this?

...

703	+ self.new_pack.finish_content()
704	+ if len(self.packs) == 1:
705	+ old_pack = self.packs[0]
706	+ if old_pack.name == self.new_pack._hash.hexdigest():
707	+ # The single old pack was already optimally packed.
708	+ self.new_pack.abort()
709	+ return None

^- I think this is a good fix, but
1) It really should have a NEWS entry and a test case [it might have a test],
as I'm sure there are people following the bug.
2) It probably should say something about the repository already being
optimally packed. At a minimum a mutter() but I'm thinking it is a reasonable
'note()'.

737	 def _get_source(self, to_format):
738	"""Return a source for streaming from this repository."""
739	- if isinstance(to_format, remote.RemoteRepositoryFormat):
740	- # Can't just check attributes on to_format with the current code,
741	- # work around this:
742	- to_format._ensure_real()
743	- to_format = to_format._custom_format
744	- if to_format.__class__ is self._format.__class__:
745	+ if (to_format.supports_chks and
746	+ self._format.repository_class is to_format.repository_class and
747	+ self._format._serializer == to_format._serializer):
748	# We must be exactly the same format, otherwise stuff like the chk
749	- # page layout might be different
750	+ # page layout might be different.
751	+ # Actually, this test is just slightly looser than exact so that
752	+ # CHK2 <-> 2a transfers will work.

^- I think this should probably be ".network_name() == other.network_name()"
and we just customize the names to be the same. Is that possible to do?

854	+ def iter_inventories(self, revision_ids, ordering='unordered'):

^- the "iter_inventories" api was written such that it always returned the
inventories in the *same* order as the supplied revisions, and we've had code
that did:

for rev, inv in izip(revs, iter_inventories(revs)):
    ...
I've tried to get away from it, because of inv.revision_id, but I think this is
a significant API break that should be documented.

That is the original reason *why* _iter_inventory_xmls() always buffered the
texts before yielding.

There are a few places that you didn't fix, such as:
bzrlib/bundle/serializer/v4.py:369:        for inv in self.repository.iter_inventories(needed_inventories)

^- This, specifically, is one of the places I was working on with my bundles +
2a fixes, and the supplied ordering is *required* (though it happens to also be
topological.)

I'm concerned that we may have more layers that you don't realize built on top
of this. For example:

def revision_trees(self, revision_ids):
        """Return Trees for revisions in this repository.

:param revision_ids: a sequence of revision-ids;
          a revision-id may not be None or 'null:'
        """
        inventories = self.iter_inventories(revision_ids)
        for inv in inventories:
            yield RevisionTree(self, inv, inv.revision_id)

^- And "revision_trees" *also* is explicitly stated that it returns the trees
in the order requested.

And you get stuff like:
bzrlib/repository.py:580:            revtrees = list(self.repository.revision_trees(self.parents))

^- this is a rather serious one in "record_iter_changes" as it really requires
the ordering. We could layer the ordering on top of it, etc.

However, I really think for *your* patch, we shouldn't change the return
ordering of "iter_inventories". We could change it as:

def iter_inventories(self, revisions, ordering=None):
  if ordering is None or ordering == 'as-requested':
      # return inventories in the exact order as requested
      ...

And then for new code that we know we can trust will handle results in
arbitrary order (or make an explicit request on the order), then we can do
better.

It also gives us a way to transition, and mark code that has been fixed. The
parameter could be extended to be passed from "revision_trees()".

1185	+ elif from_format.network_name() == self.to_format.network_name():
1186	+ # Same format.
1187	+ return self._get_simple_inventory_stream(revision_ids,
1188	+ missing=missing)
1189	+ elif (not from_format.supports_chks and not self.to_format.supports_chks
1190	+ and from_format._serializer == self.to_format._serializer):
1191	+ # Essentially the same format.
1192	+ return self._get_simple_inventory_stream(revision_ids,
1193	+ missing=missing)

^- would the second "elif" be better as just an "or ..." on the first elif?

...

1434	+ def _should_fake_unknown(self):
1435	+ # This is a workaround for bugs in pre-1.18 clients that claim to

^- This seems interesting, is it well tested that this is triggered? (And did
you manually use an older bzr and test it with a newer server?)

1660	- result.append((InterRepository,
1661	- weaverepo.RepositoryFormat5(),
1662	- knitrepo.RepositoryFormatKnit3()))
1663	- result.append((InterRepository,
1664	- knitrepo.RepositoryFormatKnit1(),
1665	- knitrepo.RepositoryFormatKnit3()))
1666	- result.append((InterRepository,
1667	- knitrepo.RepositoryFormatKnit1(),
1668	- knitrepo.RepositoryFormatKnit3()))
1669	- result.append((InterKnitRepo,
1670	- knitrepo.RepositoryFormatKnit1(),
1671	- pack_repo.RepositoryFormatKnitPack1()))
1672	- result.append((InterKnitRepo,
1673	- pack_repo.RepositoryFormatKnitPack1(),
1674	- knitrepo.RepositoryFormatKnit1()))
1675	- result.append((InterKnitRepo,
1676	- knitrepo.RepositoryFormatKnit3(),
1677	- pack_repo.RepositoryFormatKnitPack3()))
1678	- result.append((InterKnitRepo,
1679	- pack_repo.RepositoryFormatKnitPack3(),
1680	- knitrepo.RepositoryFormatKnit3()))
1681	- result.append((InterKnitRepo,
1682	- pack_repo.RepositoryFormatKnitPack3(),
1683	- pack_repo.RepositoryFormatKnitPack4()))
1684	- result.append((InterDifferingSerializer,
1685	- pack_repo.RepositoryFormatKnitPack1(),
1686	- pack_repo.RepositoryFormatKnitPack6RichRoot()))
1687	+ add_combo(weaverepo.RepositoryFormat5(),
1688	+ knitrepo.RepositoryFormatKnit3())
1689	+ add_combo(knitrepo.RepositoryFormatKnit1(),
1690	+ knitrepo.RepositoryFormatKnit3())
1691	+ add_combo(knitrepo.RepositoryFormatKnit1(),
1692	+ pack_repo.RepositoryFormatKnitPack1())

^- so aside from some obvious duplication that needed to be removed, this looks
like it removes some test permutations.

Namely, testing that both "InterRepository" *and* "InterKnitRepository" work
between certain repository formats.

I'm a little concerned that we are opening potential holes here. What do you
think?

I can't say that I did a fully thorough review of all the tests. my primary
concern is the change to "iter_inventories" which breaks a bunch of assumptions
from higher up code.

review: Needs Fixing

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-06:

On Thu, 2009-08-06 at 16:15 +0000, John A Meinel wrote:
>
> if utf8_path.startswith('/'):
>
> ^- If this is a core routine (something called for every path) then:
> if utf8_path[:1] == '/':
>
> *is* faster than .startswith() because you
> 1) Don't have a function call
> 2) Don't have an attribute lookup
>
> I'm assuming this is a function that gets called a lot. If not, don't
> worry about it.

utf8_path[:1] == '/' requires a string copy though, for all that its
heavily tuned in the VM.

> ...
>
> 566 + if required_version < (1, 18):
> 567 + # Remote side doesn't support inventory deltas. Wrap the
> stream to
> 568 + # make sure we don't send any. If the stream contains
> inventory
> 569 + # deltas we'll interrupt the smart insert_stream request and
> 570 + # fallback to VFS.
> 571 + stream = self._stop_stream_if_inventory_delta(stream)
>
> ^- it seems a bit of a shame that if we don't support deltas we fall
> back to
> VFS completely, rather than trying something intermediate (like
> falling back to
> the original code path of sending full inventory texts, or IDS, or...)
>
> I think we are probably okay, but this code at least raises a flag. I
> expect a
> bug report along the lines of "fetching between 1.18 and older server
> is very
> slow". I haven't looked at all the code paths to determine if 1.18
> will have
> regressed against a 1.17 server. Especially when *not* converting
> formats. Have
> you at least manually tested this?

We don't want to require rewindable streams; falling back to VFS is by
far the cleanest way to fallback without restarting the stream or
requiring rewinding. I agree that there is a risk of performance issues,
OTOH launchpad, our largest deployment, will upgrade quickly :).
...
> ^- I think this should probably be ".network_name() ==
> other.network_name()"
> and we just customize the names to be the same. Is that possible to
> do?

It would be a little difficult with clients deployed already that don't
know the names are the same, and further, it would make it impossible to
request initialisation of one of them. In my review I' suggesting just
serializer equality is all thats needed.

Anyhow, my review is about half done, getting back to it ones the mail
surge is complete.

-Rob

On Thu, 2009-08-06 at 16:15 +0000, John A Meinel wrote:
> 
> if utf8_path.startswith('/'):
> 
> ^- If this is a core routine (something called for every path) then:
> if utf8_path[:1] == '/':
> 
> *is* faster than .startswith() because you 
> 1) Don't have a function call
> 2) Don't have an attribute lookup
> 
> I'm assuming this is a function that gets called a lot. If not, don't
> worry about it.

utf8_path[:1] == '/' requires a string copy though, for all that its
heavily tuned in the VM.

> ...
> 
> 566     + if required_version < (1, 18):
> 567     + # Remote side doesn't support inventory deltas. Wrap the
> stream to
> 568     + # make sure we don't send any. If the stream contains
> inventory
> 569     + # deltas we'll interrupt the smart insert_stream request and
> 570     + # fallback to VFS.
> 571     + stream = self._stop_stream_if_inventory_delta(stream)
> 
> ^- it seems a bit of a shame that if we don't support deltas we fall
> back to
> VFS completely, rather than trying something intermediate (like
> falling back to
> the original code path of sending full inventory texts, or IDS, or...)
> 
> I think we are probably okay, but this code at least raises a flag. I
> expect a
> bug report along the lines of "fetching between 1.18 and older server
> is very
> slow". I haven't looked at all the code paths to determine if 1.18
> will have
> regressed against a 1.17 server. Especially when *not* converting
> formats. Have
> you at least manually tested this?

Anyhow, my review is about half done, getting back to it ones the mail
surge is complete.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-07:

Download full text (23.5 KiB)

On Wed, 2009-08-05 at 02:45 +0000, Andrew Bennetts wrote:
> === modified file 'NEWS'

>
> +* InterDifferingSerializer has been removed. The transformations it
> + provided are now done automatically by StreamSource. (Andrew Bennetts)

^ has it?

> === modified file 'bzrlib/fetch.py'
> --- bzrlib/fetch.py 2009-07-09 08:59:51 +0000
> +++ bzrlib/fetch.py 2009-07-29 07:08:54 +0000

> @@ -249,20 +251,77 @@
> # yet, and are unlikely to in non-rich-root environments anyway.
> root_id_order.sort(key=operator.itemgetter(0))
> # Create a record stream containing the roots to create.
> - def yield_roots():
> - for key in root_id_order:
> - root_id, rev_id = key
> - rev_parents = parent_map[rev_id]
> - # We drop revision parents with different file-ids, because
> - # that represents a rename of the root to a different location
> - # - its not actually a parent for us. (We could look for that
> - # file id in the revision tree at considerably more expense,
> - # but for now this is sufficient (and reconcile will catch and
> - # correct this anyway).
> - # When a parent revision is a ghost, we guess that its root id
> - # was unchanged (rather than trimming it from the parent list).
> - parent_keys = tuple((root_id, parent) for parent in rev_parents
> - if parent != NULL_REVISION and
> - rev_id_to_root_id.get(parent, root_id) == root_id)
> - yield FulltextContentFactory(key, parent_keys, None, '')
> - return [('texts', yield_roots())]
> + from bzrlib.graph import FrozenHeadsCache
> + graph = FrozenHeadsCache(graph)
> + new_roots_stream = _new_root_data_stream(
> + root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> + return [('texts', new_roots_stream)]
> +

These functions have a lot of parameters. Perhaps that would work better
as state on the Source?

They need docs/more docs respectively.

> +def _new_root_data_stream(
> + root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):
> ...
> +def _parent_keys_for_root_version(
> + root_id, rev_id, rev_id_to_root_id_map, parent_map, repo, graph=None):

On Wed, 2009-08-05 at 02:45 +0000, Andrew Bennetts wrote:
> === modified file 'NEWS'

> 
> +* InterDifferingSerializer has been removed.  The transformations it
> +  provided are now done automatically by StreamSource.  (Andrew Bennetts)

^ has it?

> === modified file 'bzrlib/fetch.py'
> --- bzrlib/fetch.py     2009-07-09 08:59:51 +0000
> +++ bzrlib/fetch.py     2009-07-29 07:08:54 +0000

> @@ -249,20 +251,77 @@
>          # yet, and are unlikely to in non-rich-root environments anyway.
>          root_id_order.sort(key=operator.itemgetter(0))
>          # Create a record stream containing the roots to create.
> -        def yield_roots():
> -            for key in root_id_order:
> -                root_id, rev_id = key
> -                rev_parents = parent_map[rev_id]
> -                # We drop revision parents with different file-ids, because
> -                # that represents a rename of the root to a different location
> -                # - its not actually a parent for us. (We could look for that
> -                # file id in the revision tree at considerably more expense,
> -                # but for now this is sufficient (and reconcile will catch and
> -                # correct this anyway).
> -                # When a parent revision is a ghost, we guess that its root id
> -                # was unchanged (rather than trimming it from the parent list).
> -                parent_keys = tuple((root_id, parent) for parent in rev_parents
> -                    if parent != NULL_REVISION and
> -                        rev_id_to_root_id.get(parent, root_id) == root_id)
> -                yield FulltextContentFactory(key, parent_keys, None, '')
> -        return [('texts', yield_roots())]
> +        from bzrlib.graph import FrozenHeadsCache
> +        graph = FrozenHeadsCache(graph)
> +        new_roots_stream = _new_root_data_stream(
> +            root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> +        return [('texts', new_roots_stream)]
> +

These functions have a lot of parameters. Perhaps that would work better
as state on the Source?

They need docs/more docs respectively.

> +def _new_root_data_stream(
> +    root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):
> ...
> +def _parent_keys_for_root_version(
> +    root_id, rev_id, rev_id_to_root_id_map, parent_map, repo, graph=None):

> === modified file 'bzrlib/help_topics/en/debug-flags.txt'
> --- bzrlib/help_topics/en/debug-flags.txt       2009-07-24 03:15:56 +0000
> +++ bzrlib/help_topics/en/debug-flags.txt       2009-08-05 02:05:43 +0000
> @@ -12,6 +12,7 @@
>                    operations.
>  -Dfetch           Trace history copying between repositories.
>  -Dfilters         Emit information for debugging content filtering.
> +-Dforceinvdeltas  Force use of inventory deltas during generic streaming fetch.
>  -Dgraph           Trace graph traversal.
>  -Dhashcache       Log every time a working file is read to determine its hash.
>  -Dhooks           Trace hook execution.
> @@ -26,3 +27,7 @@
>  -Dunlock          Some errors during unlock are treated as warnings.
>  -Dpack            Emit information about pack operations.
>  -Dsftp            Trace SFTP internals.
> +-Dstream          Trace fetch streams.
> +-DIDS:never       Never use InterDifferingSerializer when fetching.
> +-DIDS:always      Always use InterDifferingSerializer to fetch if appropriate
> +                  for the format, even for non-local fetches.

I'm a bit uncomfortable with all these debug flags. Are they really
needed? Do we anticipate asking people to use them? If not perhaps we
shouldn't list them, or should use environment variables.

> === modified file 'bzrlib/inventory_delta.py'
> --- bzrlib/inventory_delta.py   2009-04-02 05:53:12 +0000
> +++ bzrlib/inventory_delta.py   2009-08-05 02:30:11 +0000

> 
> -    def __init__(self, versioned_root, tree_references):
> -        """Create an InventoryDeltaSerializer.
> +    def __init__(self):
> +        """Create an InventoryDeltaSerializer."""
> +        self._versioned_root = None
> +        self._tree_references = None
> +        self._entry_to_content = {
> +            'directory': _directory_content,
> +            'file': _file_content,
> +            'symlink': _link_content,
> +        }
> +
> +    def require_flags(self, versioned_root=None, tree_references=None):
> +        """Set the versioned_root and/or tree_references flags for this
> +        (de)serializer.

^ why is this not in the constructor? You make the fields settable only
once, which seems identical to being set from __init__, but harder to
use. As its required to be called, it should be documented in the class
or __init__ docstring or something like that.

> @@ -221,10 +255,18 @@
>      def parse_text_bytes(self, bytes):
>          """Parse the text bytes of a serialized inventory delta.
>  
> +        If versioned_root and/or tree_references flags were set via
> +        require_flags, then the parsed flags must match or a BzrError will be
> +        raised.
> +
...

the changes to this function look like they do what they are intended to
do, but I don't understand why thats desirable or necessary: we always
know whether we want rich/nonrich etc.

> === modified file 'bzrlib/remote.py'
> --- bzrlib/remote.py    2009-07-30 04:27:05 +0000
> +++ bzrlib/remote.py    2009-08-05 00:07:14 +0000
> @@ -426,6 +426,7 @@
>          self._custom_format = None
>          self._network_name = None
>          self._creating_bzrdir = None
> +        self._supports_chks = None
>          self._supports_external_lookups = None
>          self._supports_tree_reference = None
>          self._rich_root_data = None
> @@ -443,6 +444,13 @@
>          return self._rich_root_data
>  
>      @property
> +    def supports_chks(self):
> +        if self._supports_chks is None:
> +            self._ensure_real()
> +            self._supports_chks = self._custom_format.supports_chks
> +        return self._supports_chks
> +
> +    @property
>      def supports_external_lookups(self):
>          if self._supports_external_lookups is None:
>              self._ensure_real()
> @@ -579,6 +587,11 @@
>          self._ensure_real()
>          return self._custom_format._serializer
>  
> +    @property
> +    def repository_class(self):
> +        self._ensure_real()
> +        return self._custom_format.repository_class
> +

^ this property sets off alarm bells for me. What its for?

>  class RemoteStreamSink(repository.StreamSink):
>  
> +    def __init__(self, target_repo):
> +        repository.StreamSink.__init__(self, target_repo)
> +

^ this isn't needed, as its doing nothing more than upcall.

>          path = target.bzrdir._path_for_remote_call(client)
> -        if not resume_tokens:
> -            # XXX: Ugly but important for correctness, *will* be fixed during
> -            # 1.13 cycle. Pushing a stream that is interrupted results in a
> -            # fallback to the _real_repositories sink *with a partial stream*.
> -            # Thats bad because we insert less data than bzr expected. To avoid
> -            # this we do a trial push to make sure the verb is accessible, and
> -            # do not fallback when actually pushing the stream. A cleanup patch
> -            # is going to look at rewinding/restarting the stream/partial
> -            # buffering etc.

^-- this isn't fixed, is it?

>  class RemoteStreamSource(repository.StreamSource):
>      """Stream data from a remote server."""
> @@ -1745,6 +1815,12 @@
>              sources.append(repo)
>          return self.missing_parents_chain(search, sources)
>  
> +    def get_stream_for_missing_keys(self, missing_keys):
> +        self.from_repository._ensure_real()
> +        real_repo = self.from_repository._real_repository
> +        real_source = real_repo._get_source(self.to_format)
> +        return real_source.get_stream_for_missing_keys(missing_keys)

^ this fixes bug 406686. I don't think you mention that though inNEWS
etc.

> === modified file 'bzrlib/repofmt/groupcompress_repo.py'
> --- bzrlib/repofmt/groupcompress_repo.py        2009-07-14 17:33:13 +0000
> +++ bzrlib/repofmt/groupcompress_repo.py        2009-07-22 03:38:51 +0000
> @@ -154,6 +154,8 @@
>          self._writer.begin()
>          # what state is the pack in? (open, finished, aborted)
>          self._state = 'open'
> +        # no name until we finish writing the content
> +        self.name = None
>  
>      def _check_references(self):
>          """Make sure our external references are present.
> @@ -466,6 +468,13 @@
>          if not self._use_pack(self.new_pack):
>              self.new_pack.abort()
>              return None
> +        self.new_pack.finish_content()
> +        if len(self.packs) == 1:
> +            old_pack = self.packs[0]
> +             old_pack.name == self.new_pack._hash.hexdigest():
> +                # The single old pack was already optimally packed.
> +                self.new_pack.abort()
> +                return None

^ This isn't quite right. It solves the case of 'bzr pack; bzr
pack' (and there is a bug open on that too). However it won't fix the
case where a single new pack is streamed, cross format, and also happens
to be optimally packed in the outcome. So, please file a bug on this
additional case.

> @@ -879,14 +888,13 @@
>  
>      def _get_source(self, to_format):
>          """Return a source for streaming from this repository."""
> -        if to_format.__class__ is self._format.__class__:
> +        if (to_format.supports_chks and
> +            self._format.repository_class is to_format.repository_class and
> +            self._format._serializer == to_format._serializer):

^- this change is wrong; repo format classes have constant serializers -
the check is overkill, and requiring a format to have a constant
repository class, which if we'd needed we would have just passed that in
instead. Or to say it differently, either pass the repo class in, or
don't test this.

>              # We must be exactly the same format, otherwise stuff like the chk
> -            # page layout might be different
> +            # page layout might be different.
> +            # Actually, this test is just slightly looser than exact so that
> +            # CHK2 <-> 2a transfers will work.
>              return GroupCHKStreamSource(self, to_format)

CHK2 <-> 2a have the same serializer. The test should just look at
serializer I suspect.

> === modified file 'bzrlib/repofmt/pack_repo.py'
> --- bzrlib/repofmt/pack_repo.py 2009-07-01 10:42:14 +0000
> +++ bzrlib/repofmt/pack_repo.py 2009-07-16 02:10:20 +0000

> @@ -1567,7 +1574,7 @@
>          # determine which packs need changing
>          pack_operations = [[0, []]]
>          for pack in self.all_packs():
> -            if not hint or pack.name in hint:
> +            if hint is None or pack.name in hint:

^- the original form is actually faster, AFAIK. because it skips the in
test for a hint of []. I'd rather we didn't change it, for all that its
not in a common code path.

> @@ -2093,6 +2100,7 @@
>                  # when autopack takes no steps, the names list is still
>                  # unsaved.
>                  return self._save_pack_names()
> +        return []

^- is this correctness or changing the signature? If its the latter,
update the docstring perhaps?

> === modified file 'bzrlib/repository.py'
> --- bzrlib/repository.py        2009-08-04 16:20:05 +0000
> +++ bzrlib/repository.py        2009-08-05 02:37:11 +0000
> @@ -924,6 +925,11 @@
>          """
>          if self._write_group is not self.get_transaction():
>              # has an unlock or relock occured ?
> +            if suppress_errors:
> +                mutter(
> +                '(suppressed) mismatched lock context and write group. %r, %r',
> +                self._write_group, self.get_transaction())
> +                return

^- is suppress_errors defined globally?

> +    def _get_convertable_inventory_stream(self, revision_ids,
> +                                          delta_versus_null=False):
> +        # The source is using CHKs, but the target either doesn't or is has a
> +        # different serializer.  The StreamSink code expects to be able to
> +        # convert on the target, so we need to put bytes-on-the-wire that can
> +        # be converted.  That means inventory deltas (if the remote is <1.18,
> +        # RemoteStreamSink will fallback to VFS to insert the deltas).

^- broken english 'or is has' In fact, reread the entire comment ;).

> === modified file 'bzrlib/smart/repository.py'
> --- bzrlib/smart/repository.py  2009-06-16 06:46:32 +0000
> +++ bzrlib/smart/repository.py  2009-08-04 00:51:24 +0000
> @@ -414,8 +418,39 @@
>              repository.
>          """
>          self._to_format = network_format_registry.get(to_network_name)
> +        if self._should_fake_unknown():
> +            return FailedSmartServerResponse(
> +                ('UnknownMethod', 'Repository.get_stream'))
>          return None # Signal that we want a body.
>  
> +    def _should_fake_unknown(self):

^- the long comment here could well be helped by a docstring. I think
the method name could be improved too - perhaps self._support_method(),
or self._permit_method(). Returning UnknownMethod is a special way to
support it after all.

> 
>  
> +class SmartServerRepositoryInsertStream_1_18(SmartServerRepositoryInsertStreamLocked):
> +    """Insert a record stream from a RemoteSink into a repository.
> +
> +    Same as SmartServerRepositoryInsertStreamLocked, except:
> +     - the lock token argument is optional
> +     - servers that implement this verb accept 'inventory-delta' records in the
> +       stream.
> +
> +    New in 1.18.
> +    """
> +
> +    def do_repository_request(self, repository, resume_tokens, lock_token=None):
> +        """StreamSink.insert_stream for a remote repository."""
> +        SmartServerRepositoryInsertStreamLocked.do_repository_request(
> +            self, repository, resume_tokens, lock_token)

^ this is really odd -- you upcall, unconditionally, and do nothing
else. Is the reason to just make lock_token optional? If so, why not
just change the signature on
SmartServerRepositoryInsertStreamLocked.do_repository_request? Secondly,
the base class will accept deltas too, AFAICT. So really,its that
calling the new registered verb is a way to be sure that deltas will be
accepted. I'd be happier with just:
register SSRISL twice in the registry, and document the
improvements/easements. Note that we will/should be locking always (at
the api layer, though not necessarily the network) for compatibility
with knits - its why we have SSRISLocked anyway. So the change to make
lock_token optional seems unneeded.

> === modified file 'bzrlib/tests/__init__.py'
> --- bzrlib/tests/__init__.py    2009-08-04 11:40:59 +0000
> +++ bzrlib/tests/__init__.py    2009-08-05 02:37:11 +0000
> @@ -1938,6 +1938,16 @@
>          sio.encoding = output_encoding
>          return sio
>  
> +    def disable_verb(self, verb):
> +        """Disable a smart server verb for one test."""
> +        from bzrlib.smart import request
> +        request_handlers = request.request_handlers
> +        orig_method = request_handlers.get(verb)
> +        request_handlers.remove(verb)
> +        def restoreVerb():
> +            request_handlers.register(verb, orig_method)
> +        self.addCleanup(restoreVerb)

[aside] It would be nice to find a better management pattern for these helpers.

The blackbox and unit test_push changes are identical; may be ripe for
refactoring to reduce duplication there.

> === modified file 'bzrlib/tests/per_interrepository/__init__.py'
> --- bzrlib/tests/per_interrepository/__init__.py        2009-07-10 06:46:10 +0000
> +++ bzrlib/tests/per_interrepository/__init__.py        2009-07-16 02:13:41 +0000
> @@ -32,8 +32,6 @@
>      )
>  
>  from bzrlib.repository import (
> -    InterDifferingSerializer,
> -    InterKnitRepo,
>      InterRepository,
>      )
>  from bzrlib.tests import (
> @@ -51,15 +49,13 @@
>          (interrepo_class, repository_format, repository_format_to).
>      """
>      result = []
> -    for interrepo_class, repository_format, repository_format_to in formats:
> -        id = '%s,%s,%s' % (interrepo_class.__name__,
> -                            repository_format.__class__.__name__,
> -                            repository_format_to.__class__.__name__)
> +    for repository_format, repository_format_to in formats:
> +        id = '%s,%s' % (repository_format.__class__.__name__,
> +                        repository_format_to.__class__.__name__)

^- This means that we can't specify a specific interrepo type in test
selection. I'm strongly against losing that facility.

>          scenario = (id,
>              {"transport_server": transport_server,
>               "transport_readonly_server": transport_readonly_server,
>               "repository_format": repository_format,
> -             "interrepo_class": interrepo_class,
>               "repository_format_to": repository_format_to,
>               })
>          result.append(scenario)
> @@ -68,8 +64,12 @@
>  
>  def default_test_list():
>      """Generate the default list of interrepo permutations to test."""
> -    from bzrlib.repofmt import knitrepo, pack_repo, weaverepo
> +    from bzrlib.repofmt import (
> +        knitrepo, pack_repo, weaverepo, groupcompress_repo,
> +        )
>      result = []
> +    def add_combo(from_format, to_format):
> +        result.append((from_format, to_format))
>      # test the default InterRepository between format 6 and the current
>      # default format.
>      # XXX: robertc 20060220 reinstate this when there are two supported
> @@ -80,37 +80,33 @@
>      for optimiser_class in InterRepository._optimisers:
>          format_to_test = optimiser_class._get_repo_format_to_test()
>          if format_to_test is not None:
> -            result.append((optimiser_class,
> -                           format_to_test, format_to_test))
> +            add_combo(format_to_test, format_to_test)
>      # if there are specific combinations we want to use, we can add them
>      # here. We want to test rich root upgrading.
> -    result.append((InterRepository,
> -                   weaverepo.RepositoryFormat5(),
> -                   knitrepo.RepositoryFormatKnit3()))
> -    result.append((InterRepository,
> -                   knitrepo.RepositoryFormatKnit1(),
> -                   knitrepo.RepositoryFormatKnit3()))
> -    result.append((InterRepository,
> -                   knitrepo.RepositoryFormatKnit1(),
> -                   knitrepo.RepositoryFormatKnit3()))
> -    result.append((InterKnitRepo,
> -                   knitrepo.RepositoryFormatKnit1(),
> -                   pack_repo.RepositoryFormatKnitPack1()))
> -    result.append((InterKnitRepo,
> -                   pack_repo.RepositoryFormatKnitPack1(),
> -                   knitrepo.RepositoryFormatKnit1()))
> -    result.append((InterKnitRepo,
> -                   knitrepo.RepositoryFormatKnit3(),
> -                   pack_repo.RepositoryFormatKnitPack3()))
> -    result.append((InterKnitRepo,
> -                   pack_repo.RepositoryFormatKnitPack3(),
> -                   knitrepo.RepositoryFormatKnit3()))
> -    result.append((InterKnitRepo,
> -                   pack_repo.RepositoryFormatKnitPack3(),
> -                   pack_repo.RepositoryFormatKnitPack4()))
> -    result.append((InterDifferingSerializer,
> -                   pack_repo.RepositoryFormatKnitPack1(),
> -                   pack_repo.RepositoryFormatKnitPack6RichRoot()))
> +    add_combo(weaverepo.RepositoryFormat5(),
> +              knitrepo.RepositoryFormatKnit3())
> +    add_combo(knitrepo.RepositoryFormatKnit1(),
> +              knitrepo.RepositoryFormatKnit3())
> +    add_combo(knitrepo.RepositoryFormatKnit1(),
> +              pack_repo.RepositoryFormatKnitPack1())
> +    add_combo(pack_repo.RepositoryFormatKnitPack1(),
> +              knitrepo.RepositoryFormatKnit1())
> +    add_combo(knitrepo.RepositoryFormatKnit3(),
> +              pack_repo.RepositoryFormatKnitPack3())
> +    add_combo(pack_repo.RepositoryFormatKnitPack3(),
> +              knitrepo.RepositoryFormatKnit3())
> +    add_combo(pack_repo.RepositoryFormatKnitPack3(),
> +              pack_repo.RepositoryFormatKnitPack4())
> +    add_combo(pack_repo.RepositoryFormatKnitPack1(),
> +              pack_repo.RepositoryFormatKnitPack6RichRoot())
> +    add_combo(pack_repo.RepositoryFormatKnitPack6RichRoot(),
> +              groupcompress_repo.RepositoryFormat2a())
> +    add_combo(groupcompress_repo.RepositoryFormat2a(),
> +              pack_repo.RepositoryFormatKnitPack6RichRoot())
> +    add_combo(groupcompress_repo.RepositoryFormatCHK2(),
> +              groupcompress_repo.RepositoryFormat2a())
> +    add_combo(groupcompress_repo.RepositoryFormatCHK1(),
> +              groupcompress_repo.RepositoryFormat2a())

And I share John's concern about losing coverage with the above. We were
testing the same helper with different types before, and we're not able
to do that now. This doesn't seem coupled to your actual intent, perhaps
it would be better to just backout this change.

> 
> 
> +        if not unstacked_repo._format.supports_chks:
> +            # these assertions aren't valid for groupcompress repos, which may
> +            # transfer data than strictly necessary to avoid breaking up an
> +            # already-compressed block of data.
> +            self.assertFalse(unstacked_repo.has_revision('left'))
> +            self.assertFalse(unstacked_repo.has_revision('right'))

^ please check the comment, its not quite clear.

> +    def assertCanStreamRevision(self, repo, revision_id):
> +        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
> +        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
> +        source = repo._get_source(repo._format)
> +        for substream_kind, substream in source.get_stream(search):
> +            # Consume the substream
> +            list(substream)

^ - this assertion, while named nice, isn't strong enough for me. I'd
like it to assert that either a delta matching the local changes, or a
full inventory, are streamed - and that the newly referenced file texts
are present.

> === modified file 'bzrlib/tests/test_inventory_delta.py'
> --- bzrlib/tests/test_inventory_delta.py        2009-04-02 05:53:12 +0000
> +++ bzrlib/tests/test_inventory_delta.py        2009-08-05 02:30:11 +0000
> ...

The changes in this test script match what the inventory_delta change -
but as I'm not convinced by that change they feel somewhat awkard - and
incomplete, as there are more dimensions for the object to vary in now.

> === modified file 'bzrlib/tests/test_remote.py'
> --- bzrlib/tests/test_remote.py 2009-07-30 04:27:05 +0000
> +++ bzrlib/tests/test_remote.py 2009-08-05 01:41:03 +0000
> 
> +    """Tests for using Repository.insert_stream verb when the _1.18 variant is
> +    not available.
> +
> +    This test case is very similar to TestRepositoryInsertStream_1_18.
> +    """

Perhaps a base class would make this smaller - it is nearly identical.
> 
> +    def test_stream_with_inventory_deltas(self):
> +        """'inventory-deltas' substreams can't be sent to the
> +        Repository.insert_stream verb.  So when one is encountered the
> +        RemoteSink immediately stops using that verb and falls back to VFS
> +        insert_stream.
> +        """

^this comment isn't strictly true. We *can* send them, to servers
running 1.18. But not all servers will *accept* them, and thats why we
want to be sure we *won't* send them. I appreciate this is nitpicking,
but we're encoding knowledge about the system for later developers, and
this is worth being clear on, I think.

I"m not sure how to say 'needs fixing' in review mail, but thats what
I'd say.

Its close though.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-07:

Download full text (12.6 KiB)

John A Meinel wrote:
> Review: Needs Fixing
> Overall, this is looking pretty good. A few small tweaks here and there, and
> possible concerns. But only one major concern.

That's great news. Thanks very much for the thorough review.

> You changed the default return value of "iter_inventories()" to return
> 'unordered' results, which means that "revision_trees()" also not returns
> 'unordered' results. Which is fairly serious, and I was able to find more than
> one code path that was relying on the ordering of revision_trees(). So I think
> the idea is sound, but it needs to be exposed in a backwards compatible manner
> by making it a flag that defaults to "as-requested" ordering, and then we fix
> up code paths as we can.
>
> I don't specifically need to review that change, though. And I don't really
> want to review this patch yet again :). So I'm voting tweak, and you can
> submit if you agree with my findings.

Ouch, yes, that does sound serious. I'll fix that.

> 119 + from bzrlib.graph import FrozenHeadsCache
> 120 + graph = FrozenHeadsCache(graph)
> 121 + new_roots_stream = _new_root_data_stream(
> 122 + root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> 123 + return [('texts', new_roots_stream)]
>
> ^- I'm pretty sure if we are using FrozenHeadsCache that we really want to use
> KnownGraph instead. We have a dict 'parent_map' which means we can do much
> more efficient heads() checks since the whole graph is already loaded. This is
> a minor thing we can do later, but it would probably be good not to forget.

That's a good idea, I'll try that. Perhaps we should just grep the entire code
base for FrozenHeadsCache and replace all uses with KnownGraph...

> ...
>
> 392 - elif last_modified[-1] == ':':
> 393 - raise errors.BzrError('special revisionid found: %r' % line)
> 394 - if not delta_tree_references and content.startswith('tree\x00'):
> 395 + elif newpath_utf8 != 'None' and last_modified[-1] == ':':
> 396 + # Deletes have a last_modified of null:, but otherwise special
> 397 + # revision ids should not occur.
> 398 + raise errors.BzrError('special revisionid found: %r' % line)
> 399 + if delta_tree_references is False and content.startswith('tree\x00'):
>
> ^- What does "newpath_utf8 != 'None'" mean if someone does:
>
> touch None
> bzr add None
> bzr commit -m "adding None"
>
> Is this a serialization bug waiting to be exposed?
>
> I guess not as it seems the paths are always prefixed with "/", right?
>
> (So a path of None would actually be "newpath_utf8 == '/None'")

That's right. No bug here.

> However, further down (sorry about the bad indenting):
>
> 413 if newpath_utf8 == 'None':
> 414 newpath = None
>
> ^- here you set "newpath=None" but you *don't* set "newpath_utf8" to None.
[...]
> And then here "newpath_utf8" is passed to _parse_entry.
>
> Now I realize this is probably caught by "content_tuple[0] == 'deleted'".
> Though it feels a bit icky to rely on "newpath_utf8" in one portion and
> "content_tuple[0]" in another (since they can potentially be out of sync.)

Yes. There's a bit of ickiness here too that _parse_entry redoes the utf8
decoding. I'll clean this up.

> I think if we jus...

John A Meinel wrote:
> Review: Needs Fixing
> Overall, this is looking pretty good. A few small tweaks here and there, and
> possible concerns. But only one major concern.

That's great news.  Thanks very much for the thorough review.

> You changed the default return value of "iter_inventories()" to return
> 'unordered' results, which means that "revision_trees()" also not returns
> 'unordered' results. Which is fairly serious, and I was able to find more than
> one code path that was relying on the ordering of revision_trees(). So I think
> the idea is sound, but it needs to be exposed in a backwards compatible manner
> by making it a flag that defaults to "as-requested" ordering, and then we fix
> up code paths as we can.
> 
> I don't specifically need to review that change, though. And I don't really
> want to review this patch yet again :). So I'm voting tweak, and you can
> submit if you agree with my findings.

Ouch, yes, that does sound serious.  I'll fix that.

> 119	+ from bzrlib.graph import FrozenHeadsCache
> 120	+ graph = FrozenHeadsCache(graph)
> 121	+ new_roots_stream = _new_root_data_stream(
> 122	+ root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> 123	+ return [('texts', new_roots_stream)]
> 
> ^- I'm pretty sure if we are using FrozenHeadsCache that we really want to use
> KnownGraph instead. We have a dict 'parent_map' which means we can do much
> more efficient heads() checks since the whole graph is already loaded. This is
> a minor thing we can do later, but it would probably be good not to forget.

That's a good idea, I'll try that.  Perhaps we should just grep the entire code
base for FrozenHeadsCache and replace all uses with KnownGraph...

> ...
> 
> 392	- elif last_modified[-1] == ':':
> 393	- raise errors.BzrError('special revisionid found: %r' % line)
> 394	- if not delta_tree_references and content.startswith('tree\x00'):
> 395	+ elif newpath_utf8 != 'None' and last_modified[-1] == ':':
> 396	+ # Deletes have a last_modified of null:, but otherwise special
> 397	+ # revision ids should not occur.
> 398	+ raise errors.BzrError('special revisionid found: %r' % line)
> 399	+ if delta_tree_references is False and content.startswith('tree\x00'):
> 
> ^- What does "newpath_utf8 != 'None'" mean if someone does:
> 
> touch None
> bzr add None
> bzr commit -m "adding None"
> 
> Is this a serialization bug waiting to be exposed?
> 
> I guess not as it seems the paths are always prefixed with "/", right?
> 
> (So a path of None would actually be "newpath_utf8 == '/None'")

That's right.  No bug here.

> However, further down (sorry about the bad indenting):
> 
> 413	 if newpath_utf8 == 'None':
> 414	newpath = None
> 
> ^- here you set "newpath=None" but you *don't* set "newpath_utf8" to None.
[...]
> And then here "newpath_utf8" is passed to _parse_entry.
> 
> Now I realize this is probably caught by "content_tuple[0] == 'deleted'".
> Though it feels a bit icky to rely on "newpath_utf8" in one portion and
> "content_tuple[0]" in another (since they can potentially be out of sync.)

Yes.  There's a bit of ickiness here too that _parse_entry redoes the utf8
decoding.  I'll clean this up.

> I think if we just force an early error with:
> if newpath_utf8 == 'None':
>   newpath = None
>   newpath_utf8 = None
> 
> Then if there is an issue at least it fails rather than succeeding to create a
> new file with the name "None".

The idea I think is that after this section newpath_utf8 and oldpath_utf8 aren't
needed anymore because we've either set newpath and oldpath, or we've raised an
error.  _parse_entry ought to just receive newpath, rather than re-decoding
newpath_utf8, and I've made that change.

I can insert a del newpath_utf8, oldpath_utf8 if you think that would make that
clearer?  Assigning to None is similar but less explicit, and either way it
should be done unconditionally at that point in the loop, not just for the
'None' case.

> if utf8_path.startswith('/'):
> 
> ^- If this is a core routine (something called for every path) then:
> if utf8_path[:1] == '/':
> 
> *is* faster than .startswith() because you 
> 1) Don't have a function call
> 2) Don't have an attribute lookup
> 
> I'm assuming this is a function that gets called a lot. If not, don't worry about it.

It is called for every path, but it's barely even registering in lsprof.  Just
0.40% of _extract_and_insert_inventories is spent in parse_text_bytes.
startswith accounts for 3.5% of that 0.40%, which explains why it was so hard to
find in kcachegrind :)

To be fair, this was a local push, so _stream_invs_as_deltas from the source was
being largely counted under _extract_and_insert_inventories, but it's still a
pretty miniscule amount of time.

I don't think micro-optimisations are going to make a great deal of difference
here, at least not until we remove some other bottlenecks.

(Incidentally, fixing the duplicated utf8 decode of newpath_utf8 will probably
cut ~8% of the time of parse_text_bytes according to kcachegrind, which is
already much larger than the potential benefit of replacing startswith.)

> ...
> 
> 566	+ if required_version < (1, 18):
> 567	+ # Remote side doesn't support inventory deltas. Wrap the stream to
> 568	+ # make sure we don't send any. If the stream contains inventory
> 569	+ # deltas we'll interrupt the smart insert_stream request and
> 570	+ # fallback to VFS.
> 571	+ stream = self._stop_stream_if_inventory_delta(stream)
> 
> ^- it seems a bit of a shame that if we don't support deltas we fall back to
> VFS completely, rather than trying something intermediate (like falling back to
> the original code path of sending full inventory texts, or IDS, or...)
> 
> I think we are probably okay, but this code at least raises a flag. I expect a
> bug report along the lines of "fetching between 1.18 and older server is very
> slow". I haven't looked at all the code paths to determine if 1.18 will have
> regressed against a 1.17 server. Especially when *not* converting formats. Have
> you at least manually tested this?

Fetching *cross-format* over the network is already extremely slow, due to IDS.
This isn't optimal, but should be no worse (and possibly better) than IDS.

If it isn't a cross-format fetch, then there will be no inventory-deltas stream.

So I'm pretty sure there's no regression here.

Also, as Robert points out, the alternatives are much more problematic; the sink
doesn't have any way to rewind or restart the stream it's given.  We can't even
up-cast inventory-deltas to inventory fulltexts because some formats (e.g. 2a in
1.17) don't have a usable inventory fulltext format (because it tries to use
xml5 to deserialise them).

> ...
> 
> 703	+ self.new_pack.finish_content()
> 704	+ if len(self.packs) == 1:
> 705	+ old_pack = self.packs[0]
> 706	+ if old_pack.name == self.new_pack._hash.hexdigest():
> 707	+ # The single old pack was already optimally packed.
> 708	+ self.new_pack.abort()
> 709	+ return None
> 
> ^- I think this is a good fix, but
> 1) It really should have a NEWS entry and a test case [it might have a test],
> as I'm sure there are people following the bug.

I'll add a NEWS entry.  I don't think this fixes the bug report I know of though
(perhaps one case of it?).

To be honest, I'm not sure what the right test to write here is.  I agree it
would be good to have an explicit test for this, but I'm not sure where to add
it, or how to provoke this condition in a succint way (and without grossly
violating the layering).

Tests were failing without this fix, though :)

> 2) It probably should say something about the repository already being
> optimally packed. At a minimum a mutter() but I'm thinking it is a reasonable
> 'note()'.

I'm not confident that it's a good note(), so I'll err on the side of a
mutter().  I'm pretty sure I saw this come up in tests during autopacks after a
fetch, when the user didn't explicitly ask for a pack... ah, here's a relevant
part of the conversation I had with Robert about it:

<lifeless> I suspect the following conditions are occuring:
<lifeless>  - you upload a single pack
<lifeless>  - the pack is sufficiently simple that the sort order for its contents are the same as the upload order happened to generate
<lifeless>  - the group splitting heuristics happen to line up at the same boundaries
<lifeless> -> collision, with the same content.
<lifeless> so this is natural.

So, I don't think we should emit a note() in that circumstance.

> 737	 def _get_source(self, to_format):
[...]
> 751	+ # Actually, this test is just slightly looser than exact so that
> 752	+ # CHK2 <-> 2a transfers will work.
> 
> ^- I think this should probably be ".network_name() == other.network_name()"
> and we just customize the names to be the same. Is that possible to do?

Robert already answered this.

> 854	+ def iter_inventories(self, revision_ids, ordering='unordered'):
> 
> ^- the "iter_inventories" api was written such that it always returned the
> inventories in the *same* order as the supplied revisions, and we've had code
> that did:
> 
> for rev, inv in izip(revs, iter_inventories(revs)):
>     ...
> I've tried to get away from it, because of inv.revision_id, but I think this is
> a significant API break that should be documented.
> 
> That is the original reason *why* _iter_inventory_xmls() always buffered the
> texts before yielding.

...and it still does, although it now it buffers only as is necessary to give
results in order, rather than the full result.  See the “while next_key in
text_chunks” loop, next_key comes from iter(keys), i.e. that loop yields the
results in the same order as they were passed in.  I've added a comment to
clarify.

So, actually, this change isn't having the effect I intended.  So thanks for
highlighting that!  Hmm.

The idea was that _stream_invs_as_deltas wants to generate the deltas in
topological order so that the sink can always insert them without needing to
buffer them.  Given that it was always returning them in as-requested order, I
wonder why that never caused any trouble?  The revisions list passed in there is
ordered in python dict.keys() order, I think!  I've fixed it anyway.

[...]
> However, I really think for *your* patch, we shouldn't change the return
> ordering of "iter_inventories". We could change it as:
> 
> def iter_inventories(self, revisions, ordering=None):
>   if ordering is None or ordering == 'as-requested':
>       # return inventories in the exact order as requested
>       ...

Right.  I've made a change along these lines, although just ordering=None
(having 'as-requested' be a second spelling for None/omitting the arg doesn't
seem much clearer to me).

> 1185	+ elif from_format.network_name() == self.to_format.network_name():
> 1186	+ # Same format.
> 1187	+ return self._get_simple_inventory_stream(revision_ids,
> 1188	+ missing=missing)
> 1189	+ elif (not from_format.supports_chks and not self.to_format.supports_chks
> 1190	+ and from_format._serializer == self.to_format._serializer):
> 1191	+ # Essentially the same format.
> 1192	+ return self._get_simple_inventory_stream(revision_ids,
> 1193	+ missing=missing)
> 
> ^- would the second "elif" be better as just an "or ..." on the first elif?

Personally I find multi-line if conditions to be pretty hard to untangle when
reading.  So I picked this as the lesser evil in this instance.  I'm sure
individual tastes vary, though.

> 1434	+ def _should_fake_unknown(self):
> 1435	+ # This is a workaround for bugs in pre-1.18 clients that claim to
> 
> ^- This seems interesting, is it well tested that this is triggered? (And did
> you manually use an older bzr and test it with a newer server?)

I'm pretty sure I tested this manually.  I'll double-check.  I'll also add
something to test_smart.

> 
> 1660	- result.append((InterRepository,
> 1661	- weaverepo.RepositoryFormat5(),
> 1662	- knitrepo.RepositoryFormatKnit3()))
[...]
> 1691	+ add_combo(knitrepo.RepositoryFormatKnit1(),
> 1692	+ pack_repo.RepositoryFormatKnitPack1())
> 
> 
> ^- so aside from some obvious duplication that needed to be removed, this looks
> like it removes some test permutations.
> 
> Namely, testing that both "InterRepository" *and* "InterKnitRepository" work
> between certain repository formats.
> 
> I'm a little concerned that we are opening potential holes here. What do you
> think?

Hah.  It *looks* like that, but actually nothing was using the
interrepository_class at all (except a test for the test paramaterisation
helpers!).  So actually I've not only removed duplication, I've removed the
impression that there's more coverage here than there actually is.

> I can't say that I did a fully thorough review of all the tests. my primary
> concern is the change to "iter_inventories" which breaks a bunch of assumptions
> from higher up code.

Thanks very much for the review.

-Andrew.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-07:

Download full text (24.0 KiB)

Robert Collins wrote:
> On Wed, 2009-08-05 at 02:45 +0000, Andrew Bennetts wrote:
> > === modified file 'NEWS'
>
> >
> > +* InterDifferingSerializer has been removed. The transformations it
> > + provided are now done automatically by StreamSource. (Andrew Bennetts)
>
> ^ has it?

Oops. Updated:

* InterDifferingSerializer is now only used locally. Other fetches that
  would have used InterDifferingSerializer now use the more network
  friendly StreamSource, which now automatically does the same
  transformations as InterDifferingSerializer. (Andrew Bennetts)

> > === modified file 'bzrlib/fetch.py'
> > --- bzrlib/fetch.py 2009-07-09 08:59:51 +0000
> > +++ bzrlib/fetch.py 2009-07-29 07:08:54 +0000
>
> > @@ -249,20 +251,77 @@
> > # yet, and are unlikely to in non-rich-root environments anyway.
> > root_id_order.sort(key=operator.itemgetter(0))
> > # Create a record stream containing the roots to create.
> > - def yield_roots():
> > - for key in root_id_order:
> > - root_id, rev_id = key
> > - rev_parents = parent_map[rev_id]
> > - # We drop revision parents with different file-ids, because
> > - # that represents a rename of the root to a different location
> > - # - its not actually a parent for us. (We could look for that
> > - # file id in the revision tree at considerably more expense,
> > - # but for now this is sufficient (and reconcile will catch and
> > - # correct this anyway).
> > - # When a parent revision is a ghost, we guess that its root id
> > - # was unchanged (rather than trimming it from the parent list).
> > - parent_keys = tuple((root_id, parent) for parent in rev_parents
> > - if parent != NULL_REVISION and
> > - rev_id_to_root_id.get(parent, root_id) == root_id)
> > - yield FulltextContentFactory(key, parent_keys, None, '')
> > - return [('texts', yield_roots())]
> > + from bzrlib.graph import FrozenHeadsCache
> > + graph = FrozenHeadsCache(graph)
> > + new_roots_stream = _new_root_data_stream(
> > + root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> > + return [('texts', new_roots_stream)]
> > +
>
> These functions have a lot of parameters. Perhaps that would work better
> as state on the Source?

Perhaps, although InterDifferingSerializer uses it too (that's where I
originally extracted this code from). Something to look at again when we remove
IDS, I think.

> They need docs/more docs respectively.

The methods they were refactored from didn't have docs either :P

Docs extended.

Robert Collins wrote:
> On Wed, 2009-08-05 at 02:45 +0000, Andrew Bennetts wrote:
> > === modified file 'NEWS'
> 
> > 
> > +* InterDifferingSerializer has been removed.  The transformations it
> > +  provided are now done automatically by StreamSource.  (Andrew Bennetts)
> 
> ^ has it?

Oops.  Updated:

* InterDifferingSerializer is now only used locally.  Other fetches that
  would have used InterDifferingSerializer now use the more network
  friendly StreamSource, which now automatically does the same
  transformations as InterDifferingSerializer.  (Andrew Bennetts)

> > === modified file 'bzrlib/fetch.py'
> > --- bzrlib/fetch.py     2009-07-09 08:59:51 +0000
> > +++ bzrlib/fetch.py     2009-07-29 07:08:54 +0000
> 
> > @@ -249,20 +251,77 @@
> >          # yet, and are unlikely to in non-rich-root environments anyway.
> >          root_id_order.sort(key=operator.itemgetter(0))
> >          # Create a record stream containing the roots to create.
> > -        def yield_roots():
> > -            for key in root_id_order:
> > -                root_id, rev_id = key
> > -                rev_parents = parent_map[rev_id]
> > -                # We drop revision parents with different file-ids, because
> > -                # that represents a rename of the root to a different location
> > -                # - its not actually a parent for us. (We could look for that
> > -                # file id in the revision tree at considerably more expense,
> > -                # but for now this is sufficient (and reconcile will catch and
> > -                # correct this anyway).
> > -                # When a parent revision is a ghost, we guess that its root id
> > -                # was unchanged (rather than trimming it from the parent list).
> > -                parent_keys = tuple((root_id, parent) for parent in rev_parents
> > -                    if parent != NULL_REVISION and
> > -                        rev_id_to_root_id.get(parent, root_id) == root_id)
> > -                yield FulltextContentFactory(key, parent_keys, None, '')
> > -        return [('texts', yield_roots())]
> > +        from bzrlib.graph import FrozenHeadsCache
> > +        graph = FrozenHeadsCache(graph)
> > +        new_roots_stream = _new_root_data_stream(
> > +            root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
> > +        return [('texts', new_roots_stream)]
> > +
> 
> These functions have a lot of parameters. Perhaps that would work better
> as state on the Source?

Perhaps, although InterDifferingSerializer uses it too (that's where I
originally extracted this code from).  Something to look at again when we remove
IDS, I think.

> They need docs/more docs respectively.

The methods they were refactored from didn't have docs either :P

Docs extended.

> > === modified file 'bzrlib/help_topics/en/debug-flags.txt'
> > --- bzrlib/help_topics/en/debug-flags.txt       2009-07-24 03:15:56 +0000
> > +++ bzrlib/help_topics/en/debug-flags.txt       2009-08-05 02:05:43 +0000
> > @@ -12,6 +12,7 @@
> >                    operations.
> >  -Dfetch           Trace history copying between repositories.
> >  -Dfilters         Emit information for debugging content filtering.
> > +-Dforceinvdeltas  Force use of inventory deltas during generic streaming fetch.
> >  -Dgraph           Trace graph traversal.
> >  -Dhashcache       Log every time a working file is read to determine its hash.
> >  -Dhooks           Trace hook execution.
> > @@ -26,3 +27,7 @@
> >  -Dunlock          Some errors during unlock are treated as warnings.
> >  -Dpack            Emit information about pack operations.
> >  -Dsftp            Trace SFTP internals.
> > +-Dstream          Trace fetch streams.
> > +-DIDS:never       Never use InterDifferingSerializer when fetching.
> > +-DIDS:always      Always use InterDifferingSerializer to fetch if appropriate
> > +                  for the format, even for non-local fetches.
> 
> 
> I'm a bit uncomfortable with all these debug flags. Are they really
> needed? Do we anticipate asking people to use them? If not perhaps we
> shouldn't list them, or should use environment variables.

I hope we don't need to ask people to use them, but they are a cheap insurance
policy.

More usefully, they are helpful for benchmarking and testing.  I'm ok with
hiding them if you like.

> > === modified file 'bzrlib/inventory_delta.py'
> > --- bzrlib/inventory_delta.py   2009-04-02 05:53:12 +0000
> > +++ bzrlib/inventory_delta.py   2009-08-05 02:30:11 +0000
> 
> > 
> > -    def __init__(self, versioned_root, tree_references):
> > -        """Create an InventoryDeltaSerializer.
> > +    def __init__(self):
> > +        """Create an InventoryDeltaSerializer."""
> > +        self._versioned_root = None
> > +        self._tree_references = None
> > +        self._entry_to_content = {
> > +            'directory': _directory_content,
> > +            'file': _file_content,
> > +            'symlink': _link_content,
> > +        }
> > +
> > +    def require_flags(self, versioned_root=None, tree_references=None):
> > +        """Set the versioned_root and/or tree_references flags for this
> > +        (de)serializer.
> 
> ^ why is this not in the constructor? You make the fields settable only
> once, which seems identical to being set from __init__, but harder to
> use. As its required to be called, it should be documented in the class
> or __init__ docstring or something like that.

This is a small step towards a larger change that I don't want to tackle right
now.

We really ought to have separate classes for the serializer and for the
deserializer.  The serializer could indeed have these set at construction time,
I think.

For deserializing, we don't necessary know, or care, what the flags are; e.g. a
repo that supports rich-root and tree-refs can deserialise any delta.  It was
pretty ugly trying to cope with declaring this upfront in the code; hence this
method which defers it.

> > @@ -221,10 +255,18 @@
> >      def parse_text_bytes(self, bytes):
> >          """Parse the text bytes of a serialized inventory delta.
> >  
> > +        If versioned_root and/or tree_references flags were set via
> > +        require_flags, then the parsed flags must match or a BzrError will be
> > +        raised.
> > +
> ...
> 
> the changes to this function look like they do what they are intended to
> do, but I don't understand why thats desirable or necessary: we always
> know whether we want rich/nonrich etc.

But sometimes we want both (or rather, we don't have a preference).  There's no
good reason to reject a "tree_references: false" delta sent to a repo that
supports them.

> > === modified file 'bzrlib/remote.py'
[...]
> > +    @property
> > +    def repository_class(self):
> > +        self._ensure_real()
> > +        return self._custom_format.repository_class
> > +
> 
> ^ this property sets off alarm bells for me. What its for?

Hmm, good question... ah, I think it's for the to_format.repository_class in
CHKRepository._get_source:

def _get_source(self, to_format):
        """Return a source for streaming from this repository."""
        if (to_format.supports_chks and
            self._format.repository_class is to_format.repository_class and
            self._format._serializer == to_format._serializer):

If I take your suggestion later on to just compare serializers this would be
unnecessary I think.

> >  class RemoteStreamSink(repository.StreamSink):
> >  
> > +    def __init__(self, target_repo):
> > +        repository.StreamSink.__init__(self, target_repo)
> > +
> 
> ^ this isn't needed, as its doing nothing more than upcall.

Oops, thanks.  Debugging detritus.  Removed.

> >          path = target.bzrdir._path_for_remote_call(client)
> > -        if not resume_tokens:
> > -            # XXX: Ugly but important for correctness, *will* be fixed during
> > -            # 1.13 cycle. Pushing a stream that is interrupted results in a
> > -            # fallback to the _real_repositories sink *with a partial stream*.
> > -            # Thats bad because we insert less data than bzr expected. To avoid
> > -            # this we do a trial push to make sure the verb is accessible, and
> > -            # do not fallback when actually pushing the stream. A cleanup patch
> > -            # is going to look at rewinding/restarting the stream/partial
> > -            # buffering etc.
> 
> ^-- this isn't fixed, is it?

That comment was never accurate, we never did a fallback with a partial stream.

I've re-added an accurate comment:

# Probe for the verb to use with an empty stream before sending the
        # real stream to it.  We do this both to avoid the risk of sending a
        # large request that is then rejected, and because we don't want to
        # implement a way to buffer, rewind, or restart the stream.

> >  class RemoteStreamSource(repository.StreamSource):
> >      """Stream data from a remote server."""
> > @@ -1745,6 +1815,12 @@
> >              sources.append(repo)
> >          return self.missing_parents_chain(search, sources)
> >  
> > +    def get_stream_for_missing_keys(self, missing_keys):
> > +        self.from_repository._ensure_real()
> > +        real_repo = self.from_repository._real_repository
> > +        real_source = real_repo._get_source(self.to_format)
> > +        return real_source.get_stream_for_missing_keys(missing_keys)
> 
> ^ this fixes bug 406686. I don't think you mention that though inNEWS
> etc.

I don't think that bug report was filed when I wrote that :P

Added NEWS entry.

> > === modified file 'bzrlib/repofmt/groupcompress_repo.py'
[...]
> > +        self.new_pack.finish_content()
> > +        if len(self.packs) == 1:
> > +            old_pack = self.packs[0]
> > +             old_pack.name == self.new_pack._hash.hexdigest():
> > +                # The single old pack was already optimally packed.
> > +                self.new_pack.abort()
> > +                return None
> 
> ^ This isn't quite right. It solves the case of 'bzr pack; bzr
> pack' (and there is a bug open on that too). However it won't fix the
> case where a single new pack is streamed, cross format, and also happens
> to be optimally packed in the outcome. So, please file a bug on this
> additional case.

I don't understand why that case isn't covered.  Perhaps you should file the bug
instead of me?

> > @@ -879,14 +888,13 @@
> >  
> >      def _get_source(self, to_format):
> >          """Return a source for streaming from this repository."""
> > -        if to_format.__class__ is self._format.__class__:
> > +        if (to_format.supports_chks and
> > +            self._format.repository_class is to_format.repository_class and
> > +            self._format._serializer == to_format._serializer):
> 
> ^- this change is wrong; repo format classes have constant serializers -
> the check is overkill, and requiring a format to have a constant
> repository class, which if we'd needed we would have just passed that in
> instead. Or to say it differently, either pass the repo class in, or
> don't test this.
> 
> >              # We must be exactly the same format, otherwise stuff like the chk
> > -            # page layout might be different
> > +            # page layout might be different.
> > +            # Actually, this test is just slightly looser than exact so that
> > +            # CHK2 <-> 2a transfers will work.
> >              return GroupCHKStreamSource(self, to_format)
> 
> CHK2 <-> 2a have the same serializer. The test should just look at
> serializer I suspect.

Ok, I've deleted the repository_class part of the if.  Should I delete the
“to_format.supports_chk” part too?

> > === modified file 'bzrlib/repofmt/pack_repo.py'
> > --- bzrlib/repofmt/pack_repo.py 2009-07-01 10:42:14 +0000
> > +++ bzrlib/repofmt/pack_repo.py 2009-07-16 02:10:20 +0000
> 
> > @@ -1567,7 +1574,7 @@
> >          # determine which packs need changing
> >          pack_operations = [[0, []]]
> >          for pack in self.all_packs():
> > -            if not hint or pack.name in hint:
> > +            if hint is None or pack.name in hint:
> 
> ^- the original form is actually faster, AFAIK. because it skips the in
> test for a hint of []. I'd rather we didn't change it, for all that its
> not in a common code path.

But if hint is None, then “pack.name in hint” will fail.

> > @@ -2093,6 +2100,7 @@
> >                  # when autopack takes no steps, the names list is still
> >                  # unsaved.
> >                  return self._save_pack_names()
> > +        return []
> 
> ^- is this correctness or changing the signature? If its the latter,
> update the docstring perhaps?

Both, in a sense.  A clarification if you like... it found it useful to be able
to distinguish “hint was passed, but empty” and “no hint passed”.  Otherwise
passing an empty hint triggered a full pack!

Which docstring documents the return value of
commit_write_group/_commit_write_group, though?

I can add this to Repository.commit_write_group's docstring:

:return: it may return an opaque hint that can be passed to 'pack'.

But I think it's correct that the hint should be opaque at that level.  It seems
like a detail that belongs in the scope of pack_repo.py.  I guess
InterDifferingSerializer already assumes that the hint is a list...

Basically, where should I document this?

> > === modified file 'bzrlib/repository.py'
> > --- bzrlib/repository.py        2009-08-04 16:20:05 +0000
> > +++ bzrlib/repository.py        2009-08-05 02:37:11 +0000
> > @@ -924,6 +925,11 @@
> >          """
> >          if self._write_group is not self.get_transaction():
> >              # has an unlock or relock occured ?
> > +            if suppress_errors:
> > +                mutter(
> > +                '(suppressed) mismatched lock context and write group. %r, %r',
> > +                self._write_group, self.get_transaction())
> > +                return
> 
> ^- is suppress_errors defined globally?

It's an argument to abort_write_group.  It seemed a bit ridculous to have an
explicit “don't raise, just abort” flag and then first thing the function does
is possibly raise an error ignoring that flag!  It was masking a deeper failure
for me at one point.

> > +    def _get_convertable_inventory_stream(self, revision_ids,
> > +                                          delta_versus_null=False):
> > +        # The source is using CHKs, but the target either doesn't or is has a
> > +        # different serializer.  The StreamSink code expects to be able to
> > +        # convert on the target, so we need to put bytes-on-the-wire that can
> > +        # be converted.  That means inventory deltas (if the remote is <1.18,
> > +        # RemoteStreamSink will fallback to VFS to insert the deltas).
> 
> ^- broken english 'or is has' In fact, reread the entire comment ;).

s/is/it/, :)

> > === modified file 'bzrlib/smart/repository.py'
> > --- bzrlib/smart/repository.py  2009-06-16 06:46:32 +0000
> > +++ bzrlib/smart/repository.py  2009-08-04 00:51:24 +0000
> > @@ -414,8 +418,39 @@
> >              repository.
> >          """
> >          self._to_format = network_format_registry.get(to_network_name)
> > +        if self._should_fake_unknown():
> > +            return FailedSmartServerResponse(
> > +                ('UnknownMethod', 'Repository.get_stream'))
> >          return None # Signal that we want a body.
> >  
> > +    def _should_fake_unknown(self):
> 
> ^- the long comment here could well be helped by a docstring. I think
> the method name could be improved too - perhaps self._support_method(),
> or self._permit_method(). Returning UnknownMethod is a special way to
> support it after all.

Well, UnknownMethod is precisely for “this server says I do not have this
verb”, so I'm not sure in what sense you can say returning it is a way to
support this verb.

I'm not sure how a docstring would make this clearer.  Specific suggestions
welcome.

> > +class SmartServerRepositoryInsertStream_1_18(SmartServerRepositoryInsertStreamLocked):
> > +    """Insert a record stream from a RemoteSink into a repository.
> > +
> > +    Same as SmartServerRepositoryInsertStreamLocked, except:
> > +     - the lock token argument is optional
> > +     - servers that implement this verb accept 'inventory-delta' records in the
> > +       stream.
> > +
> > +    New in 1.18.
> > +    """
> > +
> > +    def do_repository_request(self, repository, resume_tokens, lock_token=None):
> > +        """StreamSink.insert_stream for a remote repository."""
> > +        SmartServerRepositoryInsertStreamLocked.do_repository_request(
> > +            self, repository, resume_tokens, lock_token)
> 
> ^ this is really odd -- you upcall, unconditionally, and do nothing
> else. Is the reason to just make lock_token optional? If so, why not
> just change the signature on
> SmartServerRepositoryInsertStreamLocked.do_repository_request? Secondly,
> the base class will accept deltas too, AFAICT. So really,its that
> calling the new registered verb is a way to be sure that deltas will be
> accepted. I'd be happier with just:
> register SSRISL twice in the registry, and document the
> improvements/easements. Note that we will/should be locking always (at
> the api layer, though not necessarily the network) for compatibility
> with knits - its why we have SSRISLocked anyway. So the change to make
> lock_token optional seems unneeded.

I want to avoid incidental changes to old verbs to minimise the risk of
accidentally introducing cross-version imcompatibilities.  It would be easy for
someone to glance at the definition of a verb in 1.18 and see that it can take
an optional arg, and then assume that therefore it always took that arg, and
write client code accordingly.

I needed a new verb registration, obviously, and I took the opportunity to make
the locking optional, which with hindsight is how the interface should have been
in the first place.  (We didn't add the _locked variant until a release after
the original, when we realised we needed that capability and our original verb
didn't have it.)

So, I'm comfortable that this is
  a) the right interface (just one verb), and
  b) being appropriately safe about backwards compatibility (if erring slightly
     on the side of paranoia).

To be really paranoid, I *am* actually tempted to explicitly guard against
inventory-deltas in the old verb, although I'm not sure that actually buys us
much for the effort involved.

> > === modified file 'bzrlib/tests/__init__.py'
> > --- bzrlib/tests/__init__.py    2009-08-04 11:40:59 +0000
> > +++ bzrlib/tests/__init__.py    2009-08-05 02:37:11 +0000
> > @@ -1938,6 +1938,16 @@
> >          sio.encoding = output_encoding
> >          return sio
> >  
> > +    def disable_verb(self, verb):
> > +        """Disable a smart server verb for one test."""
> > +        from bzrlib.smart import request
> > +        request_handlers = request.request_handlers
> > +        orig_method = request_handlers.get(verb)
> > +        request_handlers.remove(verb)
> > +        def restoreVerb():
> > +            request_handlers.register(verb, orig_method)
> > +        self.addCleanup(restoreVerb)
> 
> [aside] It would be nice to find a better management pattern for these helpers.

[agreed]

> The blackbox and unit test_push changes are identical; may be ripe for
> refactoring to reduce duplication there.

Yeah, I thought that too... not something I'm going to do right now, and I'm not
sure that it would make a good bug report.

> > === modified file 'bzrlib/tests/per_interrepository/__init__.py'
[...]
> > -    for interrepo_class, repository_format, repository_format_to in formats:
> > -        id = '%s,%s,%s' % (interrepo_class.__name__,
> > -                            repository_format.__class__.__name__,
> > -                            repository_format_to.__class__.__name__)
> > +    for repository_format, repository_format_to in formats:
> > +        id = '%s,%s' % (repository_format.__class__.__name__,
> > +                        repository_format_to.__class__.__name__)
> 
> ^- This means that we can't specify a specific interrepo type in test
> selection. I'm strongly against losing that facility.
[...]
> And I share John's concern about losing coverage with the above. We were
> testing the same helper with different types before, and we're not able
> to do that now. This doesn't seem coupled to your actual intent, perhaps
> it would be better to just backout this change.

See my reply to John.  bzr.dev doesn't have this facility, it just misleadingly
looks like it does.  Nothing was actually using interrepo_class.  I removed the
confusion.

I considered leaving it there simply for test selection, but that's not the
right way to do that, because it can be (and was!) out of sync with what the
test was actually doing.

> > +        if not unstacked_repo._format.supports_chks:
> > +            # these assertions aren't valid for groupcompress repos, which may
> > +            # transfer data than strictly necessary to avoid breaking up an
> > +            # already-compressed block of data.
> > +            self.assertFalse(unstacked_repo.has_revision('left'))
> > +            self.assertFalse(unstacked_repo.has_revision('right'))
> 
> ^ please check the comment, its not quite clear.

s/transfer data/transfer more data/

I'm not 100% certain that's actually true?  What do you think with the adjusted
comment?

> 
> > +    def assertCanStreamRevision(self, repo, revision_id):
> > +        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
> > +        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
> > +        source = repo._get_source(repo._format)
> > +        for substream_kind, substream in source.get_stream(search):
> > +            # Consume the substream
> > +            list(substream)

> ^ - this assertion, while named nice, isn't strong enough for me. I'd
> like it to assert that either a delta matching the local changes, or a
> full inventory, are streamed - and that the newly referenced file texts
> are present.

Hmm, fair enough.  That may happen after landing, though, but not long after.

> > === modified file 'bzrlib/tests/test_inventory_delta.py'
> > --- bzrlib/tests/test_inventory_delta.py        2009-04-02 05:53:12 +0000
> > +++ bzrlib/tests/test_inventory_delta.py        2009-08-05 02:30:11 +0000
> > ...
> 
> The changes in this test script match what the inventory_delta change -
> but as I'm not convinced by that change they feel somewhat awkard - and
> incomplete, as there are more dimensions for the object to vary in now.

It is a bit awkward, I think splitting the serializer and deserializer would
help that.

> > === modified file 'bzrlib/tests/test_remote.py'
> > --- bzrlib/tests/test_remote.py 2009-07-30 04:27:05 +0000
> > +++ bzrlib/tests/test_remote.py 2009-08-05 01:41:03 +0000
> > 
> > +    """Tests for using Repository.insert_stream verb when the _1.18 variant is
> > +    not available.
> > +
> > +    This test case is very similar to TestRepositoryInsertStream_1_18.
> > +    """
> 
> Perhaps a base class would make this smaller - it is nearly identical.

I don't see any elegant ways to cut the duplication at this point.  I haven't
tried very hard though, because it's just small enough that it fits under my
threshold for tolerating the ugly...

> > +    def test_stream_with_inventory_deltas(self):
> > +        """'inventory-deltas' substreams can't be sent to the
> > +        Repository.insert_stream verb.  So when one is encountered the
> > +        RemoteSink immediately stops using that verb and falls back to VFS
> > +        insert_stream.
> > +        """
> 
> ^this comment isn't strictly true. We *can* send them, to servers
> running 1.18. But not all servers will *accept* them, and thats why we
> want to be sure we *won't* send them. I appreciate this is nitpicking,
> but we're encoding knowledge about the system for later developers, and
> this is worth being clear on, I think.

Fair enough.  Clarified:

"""'inventory-deltas' substreams cannot be sent to the
        Repository.insert_stream verb, because not all servers that implement
        that verb will accept them.  So when one is encountered the RemoteSink
        immediately stops using that verb and falls back to VFS insert_stream.
        """

> I"m not sure how to say 'needs fixing' in review mail, but thats what
> I'd say.
> 
> Its close though.

Thanks very much for the extensive review.

-Andrew.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-07:

Download full text (15.2 KiB)

On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:

I've trimmed things that are fine.

[debug flags]
> I hope we don't need to ask people to use them, but they are a cheap insurance
> policy.
>
> More usefully, they are helpful for benchmarking and testing. I'm ok with
> hiding them if you like.

I don't have a strong opinion. The flags are in our docs though as it
stands, so they should be clear to people reading them rather than
perhaps causing confusion.

> > > === modified file 'bzrlib/inventory_delta.py'
> > > --- bzrlib/inventory_delta.py 2009-04-02 05:53:12 +0000
> > > +++ bzrlib/inventory_delta.py 2009-08-05 02:30:11 +0000
> >
> > >
> > > - def __init__(self, versioned_root, tree_references):
> > > - """Create an InventoryDeltaSerializer.
> > > + def __init__(self):
> > > + """Create an InventoryDeltaSerializer."""
> > > + self._versioned_root = None
> > > + self._tree_references = None
> > > + self._entry_to_content = {
> > > + 'directory': _directory_content,
> > > + 'file': _file_content,
> > > + 'symlink': _link_content,
> > > + }
> > > +
> > > + def require_flags(self, versioned_root=None, tree_references=None):
> > > + """Set the versioned_root and/or tree_references flags for this
> > > + (de)serializer.
> >
> > ^ why is this not in the constructor? You make the fields settable only
> > once, which seems identical to being set from __init__, but harder to
> > use. As its required to be called, it should be documented in the class
> > or __init__ docstring or something like that.
>
> This is a small step towards a larger change that I don't want to tackle right
> now.
>
> We really ought to have separate classes for the serializer and for the
> deserializer. The serializer could indeed have these set at construction time,
> I think.
>
> For deserializing, we don't necessary know, or care, what the flags are; e.g. a
> repo that supports rich-root and tree-refs can deserialise any delta. It was
> pretty ugly trying to cope with declaring this upfront in the code; hence this
> method which defers it.

In principle yes, but our code always knows the source serializer,
doesn't it?

Anyhow, at the moment it seems unclear and confusing, rather than
clearer. I would find it clearer as either being on the __init__, I
think, or perhaps with two seperate constructors? No biggy, not enough
to block the patch on, but it is a very awkward halfway house at the
moment - and I wouldn't want to see it left that way - so I'm concerned
that you might switch focus away from this straight after landing it.

On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:

I've trimmed things that are fine.

[debug flags] 
> I hope we don't need to ask people to use them, but they are a cheap insurance
> policy.
> 
> More usefully, they are helpful for benchmarking and testing.  I'm ok with
> hiding them if you like.

I don't have a strong opinion. The flags are in our docs though as it
stands, so they should be clear to people reading them rather than
perhaps causing confusion.

> > > === modified file 'bzrlib/inventory_delta.py'
> > > --- bzrlib/inventory_delta.py   2009-04-02 05:53:12 +0000
> > > +++ bzrlib/inventory_delta.py   2009-08-05 02:30:11 +0000
> > 
> > > 
> > > -    def __init__(self, versioned_root, tree_references):
> > > -        """Create an InventoryDeltaSerializer.
> > > +    def __init__(self):
> > > +        """Create an InventoryDeltaSerializer."""
> > > +        self._versioned_root = None
> > > +        self._tree_references = None
> > > +        self._entry_to_content = {
> > > +            'directory': _directory_content,
> > > +            'file': _file_content,
> > > +            'symlink': _link_content,
> > > +        }
> > > +
> > > +    def require_flags(self, versioned_root=None, tree_references=None):
> > > +        """Set the versioned_root and/or tree_references flags for this
> > > +        (de)serializer.
> > 
> > ^ why is this not in the constructor? You make the fields settable only
> > once, which seems identical to being set from __init__, but harder to
> > use. As its required to be called, it should be documented in the class
> > or __init__ docstring or something like that.
> 
> This is a small step towards a larger change that I don't want to tackle right
> now.
> 
> We really ought to have separate classes for the serializer and for the
> deserializer.  The serializer could indeed have these set at construction time,
> I think.
> 
> For deserializing, we don't necessary know, or care, what the flags are; e.g. a
> repo that supports rich-root and tree-refs can deserialise any delta.  It was
> pretty ugly trying to cope with declaring this upfront in the code; hence this
> method which defers it.

In principle yes, but our code always knows the source serializer,
doesn't it?

> > > === modified file 'bzrlib/remote.py'
> [...]
> > > +    @property
> > > +    def repository_class(self):
> > > +        self._ensure_real()
> > > +        return self._custom_format.repository_class
> > > +
> > 
> > ^ this property sets off alarm bells for me. What its for?
> 
> Hmm, good question... ah, I think it's for the to_format.repository_class in
> CHKRepository._get_source:
> 
>     def _get_source(self, to_format):
>         """Return a source for streaming from this repository."""
>         if (to_format.supports_chks and
>             self._format.repository_class is to_format.repository_class and
>             self._format._serializer == to_format._serializer):
> 
> If I take your suggestion later on to just compare serializers this would be
> unnecessary I think.

Yes - please remove it then ;).

> That comment was never accurate, we never did a fallback with a partial stream.
> 
> I've re-added an accurate comment:
> 
>         # Probe for the verb to use with an empty stream before sending the
>         # real stream to it.  We do this both to avoid the risk of sending a
>         # large request that is then rejected, and because we don't want to
>         # implement a way to buffer, rewind, or restart the stream.

Danke.

> > > === modified file 'bzrlib/repofmt/groupcompress_repo.py'
> [...]
> > > +        self.new_pack.finish_content()
> > > +        if len(self.packs) == 1:
> > > +            old_pack = self.packs[0]
> > > +             old_pack.name == self.new_pack._hash.hexdigest():
> > > +                # The single old pack was already optimally packed.
> > > +                self.new_pack.abort()
> > > +                return None
> > 
> > ^ This isn't quite right. It solves the case of 'bzr pack; bzr
> > pack' (and there is a bug open on that too). However it won't fix the
> > case where a single new pack is streamed, cross format, and also happens
> > to be optimally packed in the outcome. So, please file a bug on this
> > additional case.
> 
> I don't understand why that case isn't covered.  Perhaps you should file the bug
> instead of me?

I'd like to make sure I explain it well enough first; once you
understand either of us can file it - and I'll be happy enough to be
that person.

You've added code that says
'if we have one pack, and we're adding a pack with the same hash while
doing pack(), then we were optimally packed'.
This fixes a bug in 2a 'bzr pack; bzr pack -> boom' (please do look that
bug up and close it in NEWS.

However, because of hints there is another case:
'if we have N packs, and we pack() with a hint of [one_pack], the new
pack we create could have the same hash as one_pack, because one_pack
happens to be optimally packed'.

This will show up in exactly the same manner as what you're hitting
today, on incremental pushes. (Just pushing two one-file revisions, with
a trivial change between them, in two pushes, should trigger this bug).

I think this should be fixed before landing this patch, as its pretty
tightly linked to the conversion logic for this to happen.

What you need to do is change the test from len(self.packs) == 1, to
len(packs being combined) == 1 and that_pack has the same hash.

> > > @@ -879,14 +888,13 @@

> Ok, I've deleted the repository_class part of the if.  Should I delete the
> “to_format.supports_chk” part too?

yes. self is chk_supporting, so self._serializer ==
to_format.serializer, or whatever, is fine.

> > > === modified file 'bzrlib/repofmt/pack_repo.py'
> > > --- bzrlib/repofmt/pack_repo.py 2009-07-01 10:42:14 +0000
> > > +++ bzrlib/repofmt/pack_repo.py 2009-07-16 02:10:20 +0000
> > 
> > > @@ -1567,7 +1574,7 @@
> > >          # determine which packs need changing
> > >          pack_operations = [[0, []]]
> > >          for pack in self.all_packs():
> > > -            if not hint or pack.name in hint:
> > > +            if hint is None or pack.name in hint:
> > 
> > ^- the original form is actually faster, AFAIK. because it skips the in
> > test for a hint of []. I'd rather we didn't change it, for all that its
> > not in a common code path.
> 
> But if hint is None, then “pack.name in hint” will fail.

It would, but it won't execute. As we discussed, if you could change it
back, but add a comment - its clearly worthy of expanding on (in either
form the necessity for testing hint isn't obvious, apparently).

> > > @@ -2093,6 +2100,7 @@
> > >                  # when autopack takes no steps, the names list is still
> > >                  # unsaved.
> > >                  return self._save_pack_names()
> > > +        return []
> > 
> > ^- is this correctness or changing the signature? If its the latter,
> > update the docstring perhaps?
> 
> Both, in a sense.  A clarification if you like... it found it useful to be able
> to distinguish “hint was passed, but empty” and “no hint passed”.  Otherwise
> passing an empty hint triggered a full pack!
> 
> Which docstring documents the return value of
> commit_write_group/_commit_write_group, though?
> 
> I can add this to Repository.commit_write_group's docstring:
> 
>         :return: it may return an opaque hint that can be passed to 'pack'.
> 
> But I think it's correct that the hint should be opaque at that level.  It seems
> like a detail that belongs in the scope of pack_repo.py.  I guess
> InterDifferingSerializer already assumes that the hint is a list...
> 
> Basically, where should I document this?

Repository.commit_write_group sounds fine. And yes, opaque to the repo
is correct.

> > > === modified file 'bzrlib/smart/repository.py'
> > > --- bzrlib/smart/repository.py  2009-06-16 06:46:32 +0000
> > > +++ bzrlib/smart/repository.py  2009-08-04 00:51:24 +0000
> > > @@ -414,8 +418,39 @@
> > >              repository.
> > >          """
> > >          self._to_format = network_format_registry.get(to_network_name)
> > > +        if self._should_fake_unknown():
> > > +            return FailedSmartServerResponse(
> > > +                ('UnknownMethod', 'Repository.get_stream'))
> > >          return None # Signal that we want a body.
> > >  
> > > +    def _should_fake_unknown(self):
> > 
> > ^- the long comment here could well be helped by a docstring. I think
> > the method name could be improved too - perhaps self._support_method(),
> > or self._permit_method(). Returning UnknownMethod is a special way to
> > support it after all.
> 
> Well, UnknownMethod is precisely for “this server says I do not have this
> verb”, so I'm not sure in what sense you can say returning it is a way to
> support this verb.
> 
> I'm not sure how a docstring would make this clearer.  Specific suggestions
> welcome.

As we discussed, I think the method should say what it does; the fact
that the base class encodes more complex policy than a child is a smell
(not one that you need to fix), but it would be very nice to have pydoc
help make it more obvious that the base class does this special
handling.

> > > +class SmartServerRepositoryInsertStream_1_18(SmartServerRepositoryInsertStreamLocked):
> > > +    """Insert a record stream from a RemoteSink into a repository.
> > > +
> > > +    Same as SmartServerRepositoryInsertStreamLocked, except:
> > > +     - the lock token argument is optional
> > > +     - servers that implement this verb accept 'inventory-delta' records in the
> > > +       stream.
> > > +
> > > +    New in 1.18.
> > > +    """
> > > +
> > > +    def do_repository_request(self, repository, resume_tokens, lock_token=None):
> > > +        """StreamSink.insert_stream for a remote repository."""
> > > +        SmartServerRepositoryInsertStreamLocked.do_repository_request(
> > > +            self, repository, resume_tokens, lock_token)
> > 
> > ^ this is really odd -- you upcall, unconditionally, and do nothing
> > else. Is the reason to just make lock_token optional? If so, why not
> > just change the signature on
> > SmartServerRepositoryInsertStreamLocked.do_repository_request? Secondly,
> > the base class will accept deltas too, AFAICT. So really,its that
> > calling the new registered verb is a way to be sure that deltas will be
> > accepted. I'd be happier with just:
> > register SSRISL twice in the registry, and document the
> > improvements/easements. Note that we will/should be locking always (at
> > the api layer, though not necessarily the network) for compatibility
> > with knits - its why we have SSRISLocked anyway. So the change to make
> > lock_token optional seems unneeded.
> 
> I want to avoid incidental changes to old verbs to minimise the risk of
> accidentally introducing cross-version imcompatibilities.  It would be easy for
> someone to glance at the definition of a verb in 1.18 and see that it can take
> an optional arg, and then assume that therefore it always took that arg, and
> write client code accordingly.
> 
> I needed a new verb registration, obviously, and I took the opportunity to make
> the locking optional, which with hindsight is how the interface should have been
> in the first place.  (We didn't add the _locked variant until a release after
> the original, when we realised we needed that capability and our original verb
> didn't have it.)
> 
> So, I'm comfortable that this is
>   a) the right interface (just one verb), and
>   b) being appropriately safe about backwards compatibility (if erring slightly
>      on the side of paranoia).
> 
> To be really paranoid, I *am* actually tempted to explicitly guard against
> inventory-deltas in the old verb, although I'm not sure that actually buys us
> much for the effort involved.

So, if the lock_token was required, there would be:
 - just one verb
 - no backwards compatibility issues

Up to you; I do think less code is better, all things considered.

> I considered leaving it there simply for test selection, but that's not the
> right way to do that, because it can be (and was!) out of sync with what the
> test was actually doing.

You proposed putting class names in literally, that would be fine by me.
Its no less likely to skew, but I think the issue goes deeper than 'it
can skew' - we have many potentially expensive tests being executed in
the same situation, AIUI. So we may well want to revisit this.

> > > +        if not unstacked_repo._format.supports_chks:
> > > +            # these assertions aren't valid for groupcompress repos, which may
> > > +            # transfer data than strictly necessary to avoid breaking up an
> > > +            # already-compressed block of data.
> > > +            self.assertFalse(unstacked_repo.has_revision('left'))
> > > +            self.assertFalse(unstacked_repo.has_revision('right'))
> > 
> > ^ please check the comment, its not quite clear.
> 
> s/transfer data/transfer more data/
> 
> I'm not 100% certain that's actually true?  What do you think with the adjusted
> comment?

Big alarm bell. gc repos aren't permitted to do that for the revision
objects, because of our invariants about revisions. Something else is
wrong - that if _must not_ be needed.

> > 
> > > +    def assertCanStreamRevision(self, repo, revision_id):
> > > +        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
> > > +        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
> > > +        source = repo._get_source(repo._format)
> > > +        for substream_kind, substream in source.get_stream(search):
> > > +            # Consume the substream
> > > +            list(substream)
> 
> > ^ - this assertion, while named nice, isn't strong enough for me. I'd
> > like it to assert that either a delta matching the local changes, or a
> > full inventory, are streamed - and that the newly referenced file texts
> > are present.
> 
> Hmm, fair enough.  That may happen after landing, though, but not long after.

If this is opening a whole, I'd really rather it be done before landing;
especially if you agree its worth doing. If its an existing whole that
refactoring made obvious, then leaving it till later is fine with me.

> > > === modified file 'bzrlib/tests/test_remote.py'
> > > --- bzrlib/tests/test_remote.py 2009-07-30 04:27:05 +0000
> > > +++ bzrlib/tests/test_remote.py 2009-08-05 01:41:03 +0000
> > > 
> > > +    """Tests for using Repository.insert_stream verb when the _1.18 variant is
> > > +    not available.
> > > +
> > > +    This test case is very similar to TestRepositoryInsertStream_1_18.
> > > +    """
> > 
> > Perhaps a base class would make this smaller - it is nearly identical.
> 
> I don't see any elegant ways to cut the duplication at this point.  I haven't
> tried very hard though, because it's just small enough that it fits under my
> threshold for tolerating the ugly...

It went past mine :). TemplateMethod perhaps?

> > I"m not sure how to say 'needs fixing' in review mail, but thats what
> > I'd say.
> > 
> > Its close though.
> 
> Thanks very much for the extensive review.

My pleasure. It will be great to see this land.

-Rob

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-07:

[not a complete reply, just for some points that have been dealt with]

Robert Collins wrote:
[...]
> > Ok, I've deleted the repository_class part of the if. Should I delete the
> > “to_format.supports_chk” part too?
>
> yes. self is chk_supporting, so self._serializer ==
> to_format.serializer, or whatever, is fine.

Done.

> > > > + if not unstacked_repo._format.supports_chks:
> > > > + # these assertions aren't valid for groupcompress repos, which may
> > > > + # transfer data than strictly necessary to avoid breaking up an
> > > > + # already-compressed block of data.
> > > > + self.assertFalse(unstacked_repo.has_revision('left'))
> > > > + self.assertFalse(unstacked_repo.has_revision('right'))
> > >
> > > ^ please check the comment, its not quite clear.
> >
> > s/transfer data/transfer more data/
> >
> > I'm not 100% certain that's actually true? What do you think with the adjusted
> > comment?
>
> Big alarm bell. gc repos aren't permitted to do that for the revision
> objects, because of our invariants about revisions. Something else is
> wrong - that if _must not_ be needed.

That if guards turns out to be unnecessary; that test passes for all scenarios
with those assertions run unconditionally. Whatever bug I had that prompted me
to add it is clearly gone.

-Andrew.

Revision history for this message

Martin Pool (mbp) wrote on 2009-08-07:

2009/8/7 Robert Collins <email address hidden>:
> On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:
>
> I've trimmed things that are fine.
>
> [debug flags]
>> I hope we don't need to ask people to use them, but they are a cheap insurance
>> policy.
>>
>> More usefully, they are helpful for benchmarking and testing. I'm ok with
>> hiding them if you like.
>
> I don't have a strong opinion. The flags are in our docs though as it
> stands, so they should be clear to people reading them rather than
> perhaps causing confusion.

I don't want the merge to get hung up on them, but I think they're
worth putting in.

It might get confusing if there are lots of flags mentioned in the
user documentation; also we may want to distinguish those ones that
change behaviour from those that are safe to leave on all the time to
gather data.

Andrew and I talked about this the other day and the reasoning was
this: we observed he's testing and comparing this by commenting out
some code. If someone's doing that (and not just for a quick
ten-minute comparison), it _may_ be worth leaving a debug option to do
it in future, so that

1- they don't accidentally leave it disabled (as has sometimes happened before)
2- other people can quickly repeat the same test with just the same change
3- if it turns out that the change does have performance or
functionality issues, users can try it with the other behaviour

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-07:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2009-08-06 at 16:15 +0000, John A Meinel wrote:
>> if utf8_path.startswith('/'):
>>
>> ^- If this is a core routine (something called for every path) then:
>> if utf8_path[:1] == '/':
>>
>> *is* faster than .startswith() because you
>> 1) Don't have a function call
>> 2) Don't have an attribute lookup
>>
>> I'm assuming this is a function that gets called a lot. If not, don't
>> worry about it.
>
> utf8_path[:1] == '/' requires a string copy though, for all that its
> heavily tuned in the VM.

Well, a single character string is a singleton so doesn't actually
require a malloc. Certainly that is a VM implementation, but still, time
it yourself.

A function call plus attribute lookup is *much* slower than mallocing a
string (by my testing).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkp8MtUACgkQJdeBCYSNAAM+sgCfWj/3SNNI7Adqs8hYI7vLeSJs
kLcAn3kE/ym/HYIn9KgM7b6TQH0+4uSq
=NhGl
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-07:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>>>> + if not unstacked_repo._format.supports_chks:
>>>> + # these assertions aren't valid for groupcompress repos, which may
>>>> + # transfer data than strictly necessary to avoid breaking up an
>>>> + # already-compressed block of data.
>>>> + self.assertFalse(unstacked_repo.has_revision('left'))
>>>> + self.assertFalse(unstacked_repo.has_revision('right'))
>>> ^ please check the comment, its not quite clear.
>> s/transfer data/transfer more data/
>>
>> I'm not 100% certain that's actually true? What do you think with the adjusted
>> comment?
>
> Big alarm bell. gc repos aren't permitted to do that for the revision
> objects, because of our invariants about revisions. Something else is
> wrong - that if _must not_ be needed.

Actually gc repos may send more data in a group, but they don't
*reference* the data. So the comment is actually incorrect. They don't
reference any more data than the strictly necessary set, and something
like "has_revision" should *definitely* be failing.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkp8NvQACgkQJdeBCYSNAAP4YgCcCmrttXg3iIomKS+JEsm5S6ih
pUMAoKpxevabz88K2yG5ZTx1zhB0s2Bc
=Yx5Q
-----END PGP SIGNATURE-----

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-13:

Download full text (8.7 KiB)

Robert Collins wrote:
> On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:
[...]
> [debug flags]
> > I hope we don't need to ask people to use them, but they are a cheap insurance
> > policy.
> >
> > More usefully, they are helpful for benchmarking and testing. I'm ok with
> > hiding them if you like.
>
> I don't have a strong opinion. The flags are in our docs though as it
> stands, so they should be clear to people reading them rather than
> perhaps causing confusion.

I'll leave them in, assuming I can find a way to make the ReST markup cope with
them...

[inventory_delta.py API]
> Anyhow, at the moment it seems unclear and confusing, rather than
> clearer. I would find it clearer as either being on the __init__, I
> think, or perhaps with two seperate constructors? No biggy, not enough
> to block the patch on, but it is a very awkward halfway house at the
> moment - and I wouldn't want to see it left that way - so I'm concerned
> that you might switch focus away from this straight after landing it.

I've now split the class into separate serializer and deserializer, which gives
“two separate constructors”. I think it's an improvement, I hope you'll think
so too!

[repacking single pack bug]
> > I don't understand why that case isn't covered. Perhaps you should file the bug
> > instead of me?
>
> I'd like to make sure I explain it well enough first; once you
> understand either of us can file it - and I'll be happy enough to be
> that person.
[...]
> What you need to do is change the test from len(self.packs) == 1, to
> len(packs being combined) == 1 and that_pack has the same hash.

But AIUI “self.packs” on a Packer *is* “packs being combined”. If it's not then
your explanation makes sense, but my reading of the code says otherwise.

[...]
> > > > - if not hint or pack.name in hint:
> > > > + if hint is None or pack.name in hint:
> > >
> > > ^- the original form is actually faster, AFAIK. because it skips the in
> > > test for a hint of []. I'd rather we didn't change it, for all that its
> > > not in a common code path.
> >
> > But if hint is None, then “pack.name in hint” will fail.
>
> It would, but it won't execute. As we discussed, if you could change it
> back, but add a comment - its clearly worthy of expanding on (in either
> form the necessity for testing hint isn't obvious, apparently).

Ok. (I still find it cleaner to have the hint is None case handled distinctly
from the hint is a list case, it seems to express the intent more clearly to me,
rather than making it appear to simply be a micro-optimisation. But I
understand that tastes vary.)

Oh, actually, it does matter. When hint == [], no packs should be repacked.
When hint is None, all packs should be repacked.

I've added this comment:

# Either no hint was provided (so we are packing everything),
# or this pack was included in the hint.

[...]
> > > > + def _should_fake_unknown(self):
[...]
>
> As we discussed, I think the method should say what it does; the fact
> that the base class encodes more complex policy than a child is a smell
> (not one that you need to fix), but it would be very nice...

Robert Collins wrote:
> On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:
[...]
> [debug flags] 
> > I hope we don't need to ask people to use them, but they are a cheap insurance
> > policy.
> > 
> > More usefully, they are helpful for benchmarking and testing.  I'm ok with
> > hiding them if you like.
> 
> I don't have a strong opinion. The flags are in our docs though as it
> stands, so they should be clear to people reading them rather than
> perhaps causing confusion.

I'll leave them in, assuming I can find a way to make the ReST markup cope with
them...

I've now split the class into separate serializer and deserializer, which gives
“two separate constructors”.  I think it's an improvement, I hope you'll think
so too!

[repacking single pack bug]
> > I don't understand why that case isn't covered.  Perhaps you should file the bug
> > instead of me?
> 
> I'd like to make sure I explain it well enough first; once you
> understand either of us can file it - and I'll be happy enough to be
> that person. 
[...]
> What you need to do is change the test from len(self.packs) == 1, to
> len(packs being combined) == 1 and that_pack has the same hash.

But AIUI “self.packs” on a Packer *is* “packs being combined”.  If it's not then
your explanation makes sense, but my reading of the code says otherwise.

[...]
> > > > -            if not hint or pack.name in hint:
> > > > +            if hint is None or pack.name in hint:
> > > 
> > > ^- the original form is actually faster, AFAIK. because it skips the in
> > > test for a hint of []. I'd rather we didn't change it, for all that its
> > > not in a common code path.
> > 
> > But if hint is None, then “pack.name in hint” will fail.
> 
> It would, but it won't execute. As we discussed, if you could change it
> back, but add a comment - its clearly worthy of expanding on (in either
> form the necessity for testing hint isn't obvious, apparently).

Ok.  (I still find it cleaner to have the hint is None case handled distinctly
from the hint is a list case, it seems to express the intent more clearly to me,
rather than making it appear to simply be a micro-optimisation.  But I
understand that tastes vary.)

Oh, actually, it does matter.  When hint == [], no packs should be repacked.
When hint is None, all packs should be repacked.

I've added this comment:

# Either no hint was provided (so we are packing everything),
                # or this pack was included in the hint.

[...]
> > > > +    def _should_fake_unknown(self):
[...]
> 
> As we discussed, I think the method should say what it does; the fact
> that the base class encodes more complex policy than a child is a smell
> (not one that you need to fix), but it would be very nice to have pydoc
> help make it more obvious that the base class does this special
> handling.

Made a docstring.

> > > > +class SmartServerRepositoryInsertStream_1_18(SmartServerRepositoryInsertStreamLocked):
[...]
> 
> So, if the lock_token was required, there would be:
>  - just one verb
>  - no backwards compatibility issues
> 
> Up to you; I do think less code is better, all things considered.

I'll leave it as is.  I don't think the costs of doing it this way are
insignificant, and it doesn't lock us into assuming future clients/formats must
be repos in advance before inserting a stream.

[interrepo tests]
> > I considered leaving it there simply for test selection, but that's not the
> > right way to do that, because it can be (and was!) out of sync with what the
> > test was actually doing.
> 
> You proposed putting class names in literally, that would be fine by me.
> Its no less likely to skew, but I think the issue goes deeper than 'it
> can skew' - we have many potentially expensive tests being executed in
> the same situation, AIUI. So we may well want to revisit this.

I've put the class names in literally.  I agree that the test structure here
doesn't seem optimal, and is probably worth revisiting some day.

> > > > +    def assertCanStreamRevision(self, repo, revision_id):
> > > > +        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
> > > > +        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
> > > > +        source = repo._get_source(repo._format)
> > > > +        for substream_kind, substream in source.get_stream(search):
> > > > +            # Consume the substream
> > > > +            list(substream)
> > 
> > > ^ - this assertion, while named nice, isn't strong enough for me. I'd
> > > like it to assert that either a delta matching the local changes, or a
> > > full inventory, are streamed - and that the newly referenced file texts
> > > are present.
> > 
> > Hmm, fair enough.  That may happen after landing, though, but not long after.
> 
> If this is opening a whole, I'd really rather it be done before landing;
> especially if you agree its worth doing. If its an existing whole that
> refactoring made obvious, then leaving it till later is fine with me.

Well, before it was only checking for some has_revisions and
.inventories.keys(), which turned out to be insufficient to notice some bugs.
e.g. not copying all the chk_bytes for the inventories.  This assertion caught
those.  So I think it is an incremental improvement, not opening a hole.

I think John has made some overlapping changes in bzr.dev based on the bugs I
noticed; I'll make sure this change fits his.  Oh ho!  Merging bzr.dev gets some
failures from assertCanStreamRevision that just the assertions in bzr.dev don't.
Hmm... oh!  It appears that when insert_stream tries a pack, and it turns out
the single pack is already optimal, it was accidentally causing that pack to be
removed from the pack-names list!  (Because it removes the unneeded pack, but
it's unneeded because it has the same name as the original pack, so it forgets
the original pack...).  Ok, I've fixed that now, but again it's something this
assertion catches that none of the assertions in this test in bzr.dev did :)

And for now I've kept all the assertions in these tests from bzr.dev, so the
changes here are strictly increasing coverage, so definitely no hole added.

A tangent: when we transfer parent inventories to a stacked repo, should we
transfer the texts introduced by those inventories or not?  I'm seeing a
worrisome half-and-half behaviour from the streaming code, where the synthesised
rich root, but no other texts, are transferred for inventories requested via
get_stream_for_missing_keys.  I think the intention is not (if those parent
revisions are ever filled in later, then at that time the corresponding texts
would be backfilled too).  I've now fixed this too.

The net result of this is that the stacking tests in
per_interrepository/test_fetch.py are starting to get pretty large.  I suspect
we can do better here... (but after this branch lands, I hope!)

> > > > === modified file 'bzrlib/tests/test_remote.py'
> > > > --- bzrlib/tests/test_remote.py 2009-07-30 04:27:05 +0000
> > > > +++ bzrlib/tests/test_remote.py 2009-08-05 01:41:03 +0000
> > > > 
> > > > +    """Tests for using Repository.insert_stream verb when the _1.18 variant is
> > > > +    not available.
> > > > +
> > > > +    This test case is very similar to TestRepositoryInsertStream_1_18.
> > > > +    """
> > > 
> > > Perhaps a base class would make this smaller - it is nearly identical.
> > 
> > I don't see any elegant ways to cut the duplication at this point.  I haven't
> > tried very hard though, because it's just small enough that it fits under my
> > threshold for tolerating the ugly...
> 
> It went past mine :). TemplateMethod perhaps?

The thing is that the important part of the tests is the expected network
conversation, and that's also the bulk of the near-duplication.  So I'm finding
it hard to see a way to reduce that duplication without obscuring the key part
of the test.

I've refactored out a bit of duplication that isn't the network conversation,
and it does help a bit.

> > > I"m not sure how to say 'needs fixing' in review mail, but thats what
> > > I'd say.

(It's “review needs-fixing”, btw.  Not sure when that got added.)

I've attached a psuedo-incremental diff (I merged latest bzr.dev into the tip of
this branch, and separately bzr.dev into the last reviewed revision, and then
produced the diff of those two).  So if anything looks funny in the diff you
might want to check the actual branch.

-Andrew.

psuedo-incremental.diff

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-13:

Download full text (4.7 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Robert Collins wrote:
>> On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:
> [...]
>> [debug flags]
>>> I hope we don't need to ask people to use them, but they are a cheap insurance
>>> policy.
>>>
>>> More usefully, they are helpful for benchmarking and testing. I'm ok with
>>> hiding them if you like.
>> I don't have a strong opinion. The flags are in our docs though as it
>> stands, so they should be clear to people reading them rather than
>> perhaps causing confusion.
>
> I'll leave them in, assuming I can find a way to make the ReST markup cope with
> them...
>

Well, you could always change them to IDS_always to make it consistent
with our other debug flags.

>>> I don't understand why that case isn't covered. Perhaps you should file the bug
>>> instead of me?
>> I'd like to make sure I explain it well enough first; once you
>> understand either of us can file it - and I'll be happy enough to be
>> that person.
> [...]
>> What you need to do is change the test from len(self.packs) == 1, to
>> len(packs being combined) == 1 and that_pack has the same hash.
>
> But AIUI “self.packs” on a Packer *is* “packs being combined”. If it's not then
> your explanation makes sense, but my reading of the code says otherwise.

So he might have been thinking of PackCollection.packs, but you are
right. Packer.packs is certainly the "packs being combined".

...

> Oh, actually, it does matter. When hint == [], no packs should be repacked.
> When hint is None, all packs should be repacked.

Agreed.

...

> A tangent: when we transfer parent inventories to a stacked repo, should we
> transfer the texts introduced by those inventories or not?

No. The point of transfering those parent inventories is so that we can
clearly determine what texts are in the children that are not in the
parents.

> I'm seeing a
> worrisome half-and-half behaviour from the streaming code, where the synthesised
> rich root, but no other texts, are transferred for inventories requested via
> get_stream_for_missing_keys. I think the intention is not (if those parent
> revisions are ever filled in later, then at that time the corresponding texts
> would be backfilled too). I've now fixed this too.

Right. I'm personally not very fond of what we've had to do with
stacking. We had an abstraction in place that made stacked branches
auto-fallback to their parents, then we break that abstraction, but only
when accessing via Remote, which then means we need to fill in some
data, which starts breaking the other assumptions like having an
inventory means having all the texts, etc.

Anyway, I think you have it correct, that the *new* assertion is
something along the lines of:

If we have a revision, then we have its inventory, and all the new
texts introduced by that revision, relative to all of its parents.

>
> The net result of this is that the stacking tests in
> per_interrepository/test_fetch.py are starting to get pretty large. I suspect
> we can do better here... (but after this branch lands, I hope!)
>

Well, ideally most of that would be in "per_repository_reference" which
is the...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Robert Collins wrote:
>> On Fri, 2009-08-07 at 05:39 +0000, Andrew Bennetts wrote:
> [...]
>> [debug flags] 
>>> I hope we don't need to ask people to use them, but they are a cheap insurance
>>> policy.
>>>
>>> More usefully, they are helpful for benchmarking and testing.  I'm ok with
>>> hiding them if you like.
>> I don't have a strong opinion. The flags are in our docs though as it
>> stands, so they should be clear to people reading them rather than
>> perhaps causing confusion.
> 
> I'll leave them in, assuming I can find a way to make the ReST markup cope with
> them...
>

Well, you could always change them to IDS_always to make it consistent
with our other debug flags.

>>> I don't understand why that case isn't covered.  Perhaps you should file the bug
>>> instead of me?
>> I'd like to make sure I explain it well enough first; once you
>> understand either of us can file it - and I'll be happy enough to be
>> that person. 
> [...]
>> What you need to do is change the test from len(self.packs) == 1, to
>> len(packs being combined) == 1 and that_pack has the same hash.
> 
> But AIUI “self.packs” on a Packer *is* “packs being combined”.  If it's not then
> your explanation makes sense, but my reading of the code says otherwise.

So he might have been thinking of PackCollection.packs, but you are
right. Packer.packs is certainly the "packs being combined".

...

> Oh, actually, it does matter.  When hint == [], no packs should be repacked.
> When hint is None, all packs should be repacked.

Agreed.

...

> A tangent: when we transfer parent inventories to a stacked repo, should we
> transfer the texts introduced by those inventories or not?

No. The point of transfering those parent inventories is so that we can
clearly determine what texts are in the children that are not in the
parents.

> I'm seeing a
> worrisome half-and-half behaviour from the streaming code, where the synthesised
> rich root, but no other texts, are transferred for inventories requested via
> get_stream_for_missing_keys.  I think the intention is not (if those parent
> revisions are ever filled in later, then at that time the corresponding texts
> would be backfilled too).  I've now fixed this too.

Anyway, I think you have it correct, that the *new* assertion is
something along the lines of:

If we have a revision, then we have its inventory, and all the new
    texts introduced by that revision, relative to all of its parents.

> 
> The net result of this is that the stacking tests in
> per_interrepository/test_fetch.py are starting to get pretty large.  I suspect
> we can do better here... (but after this branch lands, I hope!)
>

Well, ideally most of that would be in "per_repository_reference" which
is the formats that can be stacked.

I realize there is a bit of a push, because then you also have
per_interrepo_reference :).

...

> (It's “review needs-fixing”, btw.  Not sure when that got added.)
>

or needs_fixing or needsfixing
or possibly +0

> I've attached a psuedo-incremental diff (I merged latest bzr.dev into the tip of
> this branch, and separately bzr.dev into the last reviewed revision, and then
> produced the diff of those two).  So if anything looks funny in the diff you
> might want to check the actual branch.
> 
> -Andrew.

def _iter_inventories(self, revision_ids, ordering):
         """Iterate over many inventory objects."""
+        if ordering is None:
+            ordering = 'unordered'

^- I thought we determined that if it wasn't passed it needed to be
"as_requested" ?

I would probably change it so that you have a:
stream_ordering = XXX
versus
logical_ordering = YYY

Given the ugliness underneath introduced by the flag, I almost wonder if
it wouldn't be clearer to have 2 separate helper functions:

def _iter_inventories_as_requested()

def _iter_inventories_unordered()

I realize there would be *some* duplication, but it would probably be
clearer to figure out what was going on.

Anyway, think about it, I wouldn't block the patch on it.

The rest seems fine to me.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqEToQACgkQJdeBCYSNAAM05gCgmqmahotTEg4OVxTkPsEgA/IT
GSYAoKqOxYVsR+dxwSDBoSKTGnAxpnmJ
=kUs6
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-13:

On Thu, 2009-08-13 at 17:36 +0000, John A Meinel wrote:
>
> Anyway, I think you have it correct, that the *new* assertion is
> something along the lines of:
>
> If we have a revision, then we have its inventory, and all the new
> texts introduced by that revision, relative to all of its parents.

I think thats insufficient because of the need to be able to make a
delta. See my reply to Andrew.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2009-08-13:

On Thu, 2009-08-13 at 13:03 +0000, Andrew Bennetts wrote:

{Only replying to things that need actions}

Re: Packer.packs - Garh context and failing memory. Sounds like you're
correct. John agrees that I was misremembering - so cool, your code has
it correct.

> A tangent: when we transfer parent inventories to a stacked repo,
> should we
> transfer the texts introduced by those inventories or not? I'm seeing
> a
> worrisome half-and-half behaviour from the streaming code, where the
> synthesised
> rich root, but no other texts, are transferred for inventories
> requested via
> get_stream_for_missing_keys. I think the intention is not (if those
> parent
> revisions are ever filled in later, then at that time the
> corresponding texts
> would be backfilled too). I've now fixed this too.

There's no requirement that parent inventory entry text references be
filled; the invariant is;
"A repository can always output a delta for any revision object it has
against all the revisions parents; a delta consists of:
-revision
-signature
-inventory delta against an arbitrary parent
-all text versions not present in any parent
"

We should write this down somewhere.

Woo. _doit_

-Rob

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Andrew Bennetts

Denys Duchier

Eric Siegerman

Gary van der Merwe

Jelmer Vernooij

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

SuperMMX

Talden

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Vincent Ladeuil

Bazaar

Merge lp:~spiv/bzr/inventory-delta into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'NEWS'
 --- NEWS	2009-08-13 12:51:59 +0000
 +++ NEWS	2009-08-13 03:29:52 +0000
@@ -158,6 +158,10 @@
    lots of backtraces about ``UnknownSmartMethod``, ``do_chunk`` or
    ``do_end``.  (Andrew Bennetts, #338561)
++* ``RemoteStreamSource.get_stream_for_missing_keys`` will fetch CHK
++  inventory pages when appropriate (by falling back to the vfs stream
++  source).  (Andrew Bennetts, #406686)
++
  * Streaming from bzr servers where there is a chain of stacked branches
    (A stacked on B stacked on C) will now work. (Robert Collins, #406597)
@@ -249,8 +253,10 @@
  * ``CHKMap.apply_delta`` now raises ``InconsistentDelta`` if a delta adds
    as new a key which was already mapped. (Robert Collins)
--* InterDifferingSerializer has been removed.  The transformations it
--  provided are now done automatically by StreamSource.  (Andrew Bennetts)
++* InterDifferingSerializer is now only used locally.  Other fetches that
++  would have used InterDifferingSerializer now use the more network
++  friendly StreamSource, which now automatically does the same
++  transformations as InterDifferingSerializer.  (Andrew Bennetts)
  * Inventory delta application catches more cases of corruption and can
    prevent corrupt deltas from affecting consistency of data structures on
 === modified file 'bzrlib/fetch.py'
 --- bzrlib/fetch.py	2009-07-29 07:08:54 +0000
 +++ bzrlib/fetch.py	2009-08-07 04:27:02 +0000
@@ -260,6 +260,19 @@
  def _new_root_data_stream(
      root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):
++    """Generate a texts substream of synthesised root entries.
++
++    Used in fetches that do rich-root upgrades.
++
++    :param root_keys_to_create: iterable of (root_id, rev_id) pairs describing
++        the root entries to create.
++    :param rev_id_to_root_id_map: dict of known rev_id -> root_id mappings for
++        calculating the parents.  If a parent rev_id is not found here then it
++        will be recalculated.
++    :param parent_map: a parent map for all the revisions in
++        root_keys_to_create.
++    :param graph: a graph to use instead of repo.get_graph().
++    """
      for root_key in root_keys_to_create:
          root_id, rev_id = root_key
          parent_keys = _parent_keys_for_root_version(
@@ -270,7 +283,10 @@
  def _parent_keys_for_root_version(
      root_id, rev_id, rev_id_to_root_id_map, parent_map, repo, graph=None):
--    """Get the parent keys for a given root id."""
++    """Get the parent keys for a given root id.
++
++    A helper function for _new_root_data_stream.
++    """
      # Include direct parents of the revision, but only if they used the same
      # root_id and are heads.
      rev_parents = parent_map[rev_id]
 === modified file 'bzrlib/inventory_delta.py'
 --- bzrlib/inventory_delta.py	2009-08-05 02:30:11 +0000
 +++ bzrlib/inventory_delta.py	2009-08-11 08:40:32 +0000
@@ -29,6 +29,25 @@
  from bzrlib import inventory
  from bzrlib.revision import NULL_REVISION
++FORMAT_1 = 'bzr inventory delta v1 (bzr 1.14)'
++
++
++class InventoryDeltaError(errors.BzrError):
++    """An error when serializing or deserializing an inventory delta."""
++
++    # Most errors when serializing and deserializing are due to bugs, although
++    # damaged input (i.e. a bug in a different process) could cause
++    # deserialization errors too.
++    internal_error = True
++
++
++class IncompatibleInventoryDelta(errors.BzrError):
++    """The delta could not be deserialised because its contents conflict with
++    the allow_versioned_root or allow_tree_references flags of the
++    deserializer.
++    """
++    internal_error = False
++
  def _directory_content(entry):
      """Serialize the content component of entry which is a directory.
@@ -49,7 +68,7 @@
          exec_bytes = ''
      size_exec_sha = (entry.text_size, exec_bytes, entry.text_sha1)
      if None in size_exec_sha:
--        raise errors.BzrError('Missing size or sha for %s' % entry.file_id)
++        raise InventoryDeltaError('Missing size or sha for %s' % entry.file_id)
      return "file\x00%d\x00%s\x00%s" % size_exec_sha
@@ -60,7 +79,7 @@
      """
      target = entry.symlink_target
      if target is None:
--        raise errors.BzrError('Missing target for %s' % entry.file_id)
++        raise InventoryDeltaError('Missing target for %s' % entry.file_id)
      return "link\x00%s" % target.encode('utf8')
@@ -71,7 +90,8 @@
      """
      tree_revision = entry.reference_revision
      if tree_revision is None:
--        raise errors.BzrError('Missing reference revision for %s' % entry.file_id)
++        raise InventoryDeltaError(
++            'Missing reference revision for %s' % entry.file_id)
      return "tree\x00%s" % tree_revision
@@ -116,39 +136,22 @@
  class InventoryDeltaSerializer(object):
--    """Serialize and deserialize inventory deltas."""
--
--    # XXX: really, the serializer and deserializer should be two separate
--    # classes.
--
--    FORMAT_1 = 'bzr inventory delta v1 (bzr 1.14)'
--
--    def __init__(self):
--        """Create an InventoryDeltaSerializer."""
--        self._versioned_root = None
--        self._tree_references = None
++    """Serialize inventory deltas."""
++
++    def __init__(self, versioned_root, tree_references):
++        """Create an InventoryDeltaSerializer.
++
++        :param versioned_root: If True, any root entry that is seen is expected
++            to be versioned, and root entries can have any fileid.
++        :param tree_references: If True support tree-reference entries.
++        """
++        self._versioned_root = versioned_root
++        self._tree_references = tree_references
          self._entry_to_content = {
              'directory': _directory_content,
              'file': _file_content,
              'symlink': _link_content,
+         }
--
--    def require_flags(self, versioned_root=None, tree_references=None):
--        """Set the versioned_root and/or tree_references flags for this
--        (de)serializer.
--
--        :param versioned_root: If True, any root entry that is seen is expected
--            to be versioned, and root entries can have any fileid.
--        :param tree_references: If True support tree-reference entries.
--        """
--        if versioned_root is not None and self._versioned_root is not None:
--            raise AssertionError(
--                "require_flags(versioned_root=...) already called.")
--        if tree_references is not None and self._tree_references is not None:
--            raise AssertionError(
--                "require_flags(tree_references=...) already called.")
--        self._versioned_root = versioned_root
--        self._tree_references = tree_references
          if tree_references:
              self._entry_to_content['tree-reference'] = _reference_content
@@ -167,10 +170,6 @@
              takes.
          :return: The serialized delta as lines.
          """
--        if self._versioned_root is None or self._tree_references is None:
--            raise AssertionError(
--                "Cannot serialise unless versioned_root/tree_references flags "
--                "are both set.")
          if type(old_name) is not str:
              raise TypeError('old_name should be str, got %r' % (old_name,))
          if type(new_name) is not str:
@@ -180,11 +179,11 @@
          for delta_item in delta_to_new:
              line = to_line(delta_item, new_name)
              if line.__class__ != str:
--                raise errors.BzrError(
++                raise InventoryDeltaError(
                      'to_line generated non-str output %r' % lines[-1])
              lines.append(line)
          lines.sort()
--        lines[0] = "format: %s\n" % InventoryDeltaSerializer.FORMAT_1
++        lines[0] = "format: %s\n" % FORMAT_1
          lines[1] = "parent: %s\n" % old_name
          lines[2] = "version: %s\n" % new_name
          lines[3] = "versioned_root: %s\n" % self._serialize_bool(
@@ -234,23 +233,37 @@
                  # file-ids other than TREE_ROOT, e.g. repo formats that use the
                  # xml5 serializer.
                  if last_modified != new_version:
--                    raise errors.BzrError(
++                    raise InventoryDeltaError(
                          'Version present for / in %s (%s != %s)'
                          % (file_id, last_modified, new_version))
              if last_modified is None:
--                raise errors.BzrError("no version for fileid %s" % file_id)
++                raise InventoryDeltaError("no version for fileid %s" % file_id)
              content = self._entry_to_content[entry.kind](entry)
          return ("%s\x00%s\x00%s\x00%s\x00%s\x00%s\n" %
              (oldpath_utf8, newpath_utf8, file_id, parent_id, last_modified,
                  content))
++
++class InventoryDeltaDeserializer(object):
++    """Deserialize inventory deltas."""
++
++    def __init__(self, allow_versioned_root=True, allow_tree_references=True):
++        """Create an InventoryDeltaDeserializer.
++
++        :param versioned_root: If True, any root entry that is seen is expected
++            to be versioned, and root entries can have any fileid.
++        :param tree_references: If True support tree-reference entries.
++        """
++        self._allow_versioned_root = allow_versioned_root
++        self._allow_tree_references = allow_tree_references
++
      def _deserialize_bool(self, value):
          if value == "true":
              return True
          elif value == "false":
              return False
          else:
--            raise errors.BzrError("value %r is not a bool" % (value,))
++            raise InventoryDeltaError("value %r is not a bool" % (value,))
      def parse_text_bytes(self, bytes):
          """Parse the text bytes of a serialized inventory delta.
@@ -266,32 +279,24 @@
          """
          if bytes[-1:] != '\n':
              last_line = bytes.rsplit('\n', 1)[-1]
--            raise errors.BzrError('last line not empty: %r' % (last_line,))
++            raise InventoryDeltaError('last line not empty: %r' % (last_line,))
          lines = bytes.split('\n')[:-1] # discard the last empty line
--        if not lines or lines[0] != 'format: %s' % InventoryDeltaSerializer.FORMAT_1:
--            raise errors.BzrError('unknown format %r' % lines[0:1])
++        if not lines or lines[0] != 'format: %s' % FORMAT_1:
++            raise InventoryDeltaError('unknown format %r' % lines[0:1])
          if len(lines) < 2 or not lines[1].startswith('parent: '):
--            raise errors.BzrError('missing parent: marker')
++            raise InventoryDeltaError('missing parent: marker')
          delta_parent_id = lines[1][8:]
          if len(lines) < 3 or not lines[2].startswith('version: '):
--            raise errors.BzrError('missing version: marker')
++            raise InventoryDeltaError('missing version: marker')
          delta_version_id = lines[2][9:]
          if len(lines) < 4 or not lines[3].startswith('versioned_root: '):
--            raise errors.BzrError('missing versioned_root: marker')
++            raise InventoryDeltaError('missing versioned_root: marker')
          delta_versioned_root = self._deserialize_bool(lines[3][16:])
          if len(lines) < 5 or not lines[4].startswith('tree_references: '):
--            raise errors.BzrError('missing tree_references: marker')
++            raise InventoryDeltaError('missing tree_references: marker')
          delta_tree_references = self._deserialize_bool(lines[4][17:])
--        if (self._versioned_root is not None and
--            delta_versioned_root != self._versioned_root):
--            raise errors.BzrError(
--                "serialized versioned_root flag is wrong: %s" %
--                (delta_versioned_root,))
--        if (self._tree_references is not None
--            and delta_tree_references != self._tree_references):
--            raise errors.BzrError(
--                "serialized tree_references flag is wrong: %s" %
--                (delta_tree_references,))
++        if (not self._allow_versioned_root and delta_versioned_root):
++            raise IncompatibleInventoryDelta("versioned_root not allowed")
          result = []
          seen_ids = set()
          line_iter = iter(lines)
@@ -302,24 +307,30 @@
                  content) = line.split('\x00', 5)
              parent_id = parent_id or None
              if file_id in seen_ids:
--                raise errors.BzrError(
++                raise InventoryDeltaError(
                      "duplicate file id in inventory delta %r" % lines)
              seen_ids.add(file_id)
              if (newpath_utf8 == '/' and not delta_versioned_root and
                  last_modified != delta_version_id):
--                    # Delta claims to be not rich root, yet here's a root entry
--                    # with either non-default version, i.e. it's rich...
--                    raise errors.BzrError("Versioned root found: %r" % line)
++                    # Delta claims to be not have a versioned root, yet here's
++                    # a root entry with a non-default version.
++                    raise InventoryDeltaError("Versioned root found: %r" % line)
              elif newpath_utf8 != 'None' and last_modified[-1] == ':':
                  # Deletes have a last_modified of null:, but otherwise special
                  # revision ids should not occur.
--                raise errors.BzrError('special revisionid found: %r' % line)
--            if delta_tree_references is False and content.startswith('tree\x00'):
--                raise errors.BzrError("Tree reference found: %r" % line)
++                raise InventoryDeltaError('special revisionid found: %r' % line)
++            if content.startswith('tree\x00'):
++                if delta_tree_references is False:
++                    raise InventoryDeltaError(
++                            "Tree reference found (but header said "
++                            "tree_references: false): %r" % line)
++                elif not self._allow_tree_references:
++                    raise IncompatibleInventoryDelta(
++                        "Tree reference not allowed")
              if oldpath_utf8 == 'None':
                  oldpath = None
              elif oldpath_utf8[:1] != '/':
--                raise errors.BzrError(
++                raise InventoryDeltaError(
                      "oldpath invalid (does not start with /): %r"
                      % (oldpath_utf8,))
              else:
@@ -328,7 +339,7 @@
              if newpath_utf8 == 'None':
                  newpath = None
              elif newpath_utf8[:1] != '/':
--                raise errors.BzrError(
++                raise InventoryDeltaError(
                      "newpath invalid (does not start with /): %r"
                      % (newpath_utf8,))
              else:
@@ -340,15 +351,14 @@
                  entry = None
              else:
                  entry = _parse_entry(
--                    newpath_utf8, file_id, parent_id, last_modified,
--                    content_tuple)
++                    newpath, file_id, parent_id, last_modified, content_tuple)
              delta_item = (oldpath, newpath, file_id, entry)
              result.append(delta_item)
          return (delta_parent_id, delta_version_id, delta_versioned_root,
                  delta_tree_references, result)
--def _parse_entry(utf8_path, file_id, parent_id, last_modified, content):
++def _parse_entry(path, file_id, parent_id, last_modified, content):
      entry_factory = {
          'dir': _dir_to_entry,
          'file': _file_to_entry,
@@ -356,10 +366,10 @@
          'tree': _tree_to_entry,
+     }
      kind = content[0]
--    if utf8_path.startswith('/'):
++    if path.startswith('/'):
          raise AssertionError
--    path = utf8_path.decode('utf8')
      name = basename(path)
      return entry_factory[content[0]](
              content, name, parent_id, file_id, last_modified)
++
 === modified file 'bzrlib/remote.py'
 --- bzrlib/remote.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/remote.py	2009-08-13 08:21:02 +0000
@@ -587,11 +587,6 @@
          self._ensure_real()
          return self._custom_format._serializer
--    @property
--    def repository_class(self):
--        self._ensure_real()
--        return self._custom_format.repository_class
--
  class RemoteRepository(_RpcHelper):
      """Repository accessed over rpc.
@@ -1684,9 +1679,6 @@
  class RemoteStreamSink(repository.StreamSink):
--    def __init__(self, target_repo):
--        repository.StreamSink.__init__(self, target_repo)
--
      def _insert_real(self, stream, src_format, resume_tokens):
          self.target_repo._ensure_real()
          sink = self.target_repo._real_repository._get_sink()
@@ -1708,6 +1700,10 @@
          client = target._client
          medium = client._medium
          path = target.bzrdir._path_for_remote_call(client)
++        # Probe for the verb to use with an empty stream before sending the
++        # real stream to it.  We do this both to avoid the risk of sending a
++        # large request that is then rejected, and because we don't want to
++        # implement a way to buffer, rewind, or restart the stream.
          found_verb = False
          for verb, required_version in candidate_calls:
              if medium._is_remote_before(required_version):
 === modified file 'bzrlib/repofmt/groupcompress_repo.py'
 --- bzrlib/repofmt/groupcompress_repo.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/repofmt/groupcompress_repo.py	2009-08-13 08:20:28 +0000
@@ -484,6 +484,8 @@
              old_pack = self.packs[0]
              if old_pack.name == self.new_pack._hash.hexdigest():
                  # The single old pack was already optimally packed.
++                trace.mutter('single pack %s was already optimally packed',
++                    old_pack.name)
                  self.new_pack.abort()
                  return None
          self.pb.update('finishing repack', 6, 7)
@@ -600,7 +602,7 @@
              packer = GCCHKPacker(self, packs, '.autopack',
                                   reload_func=reload_func)
              try:
--                packer.pack()
++                result = packer.pack()
              except errors.RetryWithNewPacks:
                  # An exception is propagating out of this context, make sure
                  # this packer has cleaned up. Packer() doesn't set its new_pack
@@ -609,6 +611,8 @@
                  if packer.new_pack is not None:
                      packer.new_pack.abort()
                  raise
++            if result is None:
++                return
              for pack in packs:
                  self._remove_pack_from_memory(pack)
          # record the newly available packs and stop advertising the old
@@ -792,6 +796,8 @@
      def _iter_inventories(self, revision_ids, ordering):
          """Iterate over many inventory objects."""
++        if ordering is None:
++            ordering = 'unordered'
          keys = [(revision_id,) for revision_id in revision_ids]
          stream = self.inventories.get_record_stream(keys, ordering, True)
          texts = {}
@@ -903,9 +909,7 @@
      def _get_source(self, to_format):
          """Return a source for streaming from this repository."""
--        if (to_format.supports_chks and
--            self._format.repository_class is to_format.repository_class and
--            self._format._serializer == to_format._serializer):
++        if self._format._serializer == to_format._serializer:
              # We must be exactly the same format, otherwise stuff like the chk
              # page layout might be different.
              # Actually, this test is just slightly looser than exact so that
 === modified file 'bzrlib/repofmt/pack_repo.py'
 --- bzrlib/repofmt/pack_repo.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/repofmt/pack_repo.py	2009-08-13 07:26:29 +0000
@@ -1575,6 +1575,8 @@
          pack_operations = [[0, []]]
          for pack in self.all_packs():
              if hint is None or pack.name in hint:
++                # Either no hint was provided (so we are packing everything),
++                # or this pack was included in the hint.
                  pack_operations[-1][0] += pack.get_revision_count()
                  pack_operations[-1][1].append(pack)
          self._execute_pack_operations(pack_operations, OptimisingPacker)
 === modified file 'bzrlib/repository.py'
 --- bzrlib/repository.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/repository.py	2009-08-13 08:56:51 +0000
@@ -1537,6 +1537,8 @@
          """Commit the contents accrued within the current write group.
          :seealso: start_write_group.
++
++        :return: it may return an opaque hint that can be passed to 'pack'.
          """
          if self._write_group is not self.get_transaction():
              # has an unlock or relock occured ?
@@ -2348,7 +2350,7 @@
          """Get Inventory object by revision id."""
          return self.iter_inventories([revision_id]).next()
--    def iter_inventories(self, revision_ids, ordering='unordered'):
++    def iter_inventories(self, revision_ids, ordering=None):
          """Get many inventories by revision_ids.
          This will buffer some or all of the texts used in constructing the
@@ -2356,7 +2358,9 @@
          time.
          :param revision_ids: The expected revision ids of the inventories.
--        :param ordering: optional ordering, e.g. 'topological'.
++        :param ordering: optional ordering, e.g. 'topological'.  If not
++            specified, the order of revision_ids will be preserved (by
++            buffering if necessary).
          :return: An iterator of inventories.
          """
          if ((None in revision_ids)
@@ -2370,29 +2374,41 @@
          for text, revision_id in inv_xmls:
              yield self.deserialise_inventory(revision_id, text)
--    def _iter_inventory_xmls(self, revision_ids, ordering='unordered'):
++    def _iter_inventory_xmls(self, revision_ids, ordering):
++        if ordering is None:
++            order_as_requested = True
++            ordering = 'unordered'
++        else:
++            order_as_requested = False
          keys = [(revision_id,) for revision_id in revision_ids]
          if not keys:
              return
--        key_iter = iter(keys)
--        next_key = key_iter.next()
++        if order_as_requested:
++            key_iter = iter(keys)
++            next_key = key_iter.next()
          stream = self.inventories.get_record_stream(keys, ordering, True)
          text_chunks = {}
          for record in stream:
              if record.storage_kind != 'absent':
--                text_chunks[record.key] = record.get_bytes_as('chunked')
++                chunks = record.get_bytes_as('chunked')
++                if order_as_requested:
++                    text_chunks[record.key] = chunks
++                else:
++                    yield ''.join(chunks), record.key[-1]
              else:
                  raise errors.NoSuchRevision(self, record.key)
--            while next_key in text_chunks:
--                chunks = text_chunks.pop(next_key)
--                yield ''.join(chunks), next_key[-1]
--                try:
--                    next_key = key_iter.next()
--                except StopIteration:
--                    # We still want to fully consume the get_record_stream,
--                    # just in case it is not actually finished at this point
--                    next_key = None
--                    break
++            if order_as_requested:
++                # Yield as many results as we can while preserving order.
++                while next_key in text_chunks:
++                    chunks = text_chunks.pop(next_key)
++                    yield ''.join(chunks), next_key[-1]
++                    try:
++                        next_key = key_iter.next()
++                    except StopIteration:
++                        # We still want to fully consume the get_record_stream,
++                        # just in case it is not actually finished at this point
++                        next_key = None
++                        break
      def deserialise_inventory(self, revision_id, xml):
          """Transform the xml into an inventory object.
@@ -4224,20 +4240,14 @@
          for record in substream:
              # Insert the delta directly
              inventory_delta_bytes = record.get_bytes_as('fulltext')
--            deserialiser = inventory_delta.InventoryDeltaSerializer()
--            parse_result = deserialiser.parse_text_bytes(inventory_delta_bytes)
++            deserialiser = inventory_delta.InventoryDeltaDeserializer()
++            try:
++                parse_result = deserialiser.parse_text_bytes(
++                    inventory_delta_bytes)
++            except inventory_delta.IncompatibleInventoryDelta, err:
++                trace.mutter("Incompatible delta: %s", err.msg)
++                raise errors.IncompatibleRevision(self.target_repo._format)
              basis_id, new_id, rich_root, tree_refs, inv_delta = parse_result
--            # Make sure the delta is compatible with the target
--            if rich_root and not target_rich_root:
--                raise errors.IncompatibleRevision(self.target_repo._format)
--            if tree_refs and not target_tree_refs:
--                # The source supports tree refs and the target doesn't.  Check
--                # the delta for tree refs; if it has any we can't insert it.
--                for delta_item in inv_delta:
--                    entry = delta_item[3]
--                    if entry.kind == 'tree-reference':
--                        raise errors.IncompatibleRevision(
--                            self.target_repo._format)
              revision_id = new_id
              parents = [key[0] for key in record.parents]
              self.target_repo.add_inventory_by_delta(
@@ -4404,10 +4414,6 @@
                  # Some missing keys are genuinely ghosts, filter those out.
                  present = self.from_repository.inventories.get_parent_map(keys)
                  revs = [key[0] for key in present]
--                # As with the original stream, we may need to generate root
--                # texts for the inventories we're about to stream.
--                for _ in self._generate_root_texts(revs):
--                    yield _
                  # Get the inventory stream more-or-less as we do for the
                  # original stream; there's no reason to assume that records
                  # direct from the source will be suitable for the sink.  (Think
@@ -4474,7 +4480,7 @@
      def _get_convertable_inventory_stream(self, revision_ids,
                                            delta_versus_null=False):
--        # The source is using CHKs, but the target either doesn't or is has a
++        # The source is using CHKs, but the target either doesn't or it has a
          # different serializer.  The StreamSink code expects to be able to
          # convert on the target, so we need to put bytes-on-the-wire that can
          # be converted.  That means inventory deltas (if the remote is <1.18,
@@ -4499,17 +4505,17 @@
          # method...
          inventories = self.from_repository.iter_inventories(
              revision_ids, 'topological')
--        # XXX: ideally these flags would be per-revision, not per-repo (e.g.
--        # streaming a non-rich-root revision out of a rich-root repo back into
--        # a non-rich-root repo ought to be allowed)
          format = from_repo._format
--        flags = (format.rich_root_data, format.supports_tree_reference)
          invs_sent_so_far = set([_mod_revision.NULL_REVISION])
          inventory_cache = lru_cache.LRUCache(50)
          null_inventory = from_repo.revision_tree(
              _mod_revision.NULL_REVISION).inventory
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(*flags)
++        # XXX: ideally the rich-root/tree-refs flags would be per-revision, not
++        # per-repo (e.g.  streaming a non-rich-root revision out of a rich-root
++        # repo back into a non-rich-root repo ought to be allowed)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=format.rich_root_data,
++            tree_references=format.supports_tree_reference)
          for inv in inventories:
              key = (inv.revision_id,)
              parent_keys = parent_map.get(key, ())
 === modified file 'bzrlib/smart/repository.py'
 --- bzrlib/smart/repository.py	2009-08-04 00:51:24 +0000
 +++ bzrlib/smart/repository.py	2009-08-13 08:20:53 +0000
@@ -424,18 +424,21 @@
          return None # Signal that we want a body.
      def _should_fake_unknown(self):
--        # This is a workaround for bugs in pre-1.18 clients that claim to
--        # support receiving streams of CHK repositories.  The pre-1.18 client
--        # expects inventory records to be serialized in the format defined by
--        # to_network_name, but in pre-1.18 (at least) that format definition
--        # tries to use the xml5 serializer, which does not correctly handle
--        # rich-roots.  After 1.18 the client can also accept inventory-deltas
--        # (which avoids this issue), and those clients will use the
--        # Repository.get_stream_1.18 verb instead of this one.
--        # So: if this repository is CHK, and the to_format doesn't match,
--        # we should just fake an UnknownSmartMethod error so that the client
--        # will fallback to VFS, rather than sending it a stream we know it
--        # cannot handle.
++        """Return True if we should return UnknownMethod to the client.
++
++        This is a workaround for bugs in pre-1.18 clients that claim to
++        support receiving streams of CHK repositories.  The pre-1.18 client
++        expects inventory records to be serialized in the format defined by
++        to_network_name, but in pre-1.18 (at least) that format definition
++        tries to use the xml5 serializer, which does not correctly handle
++        rich-roots.  After 1.18 the client can also accept inventory-deltas
++        (which avoids this issue), and those clients will use the
++        Repository.get_stream_1.18 verb instead of this one.
++        So: if this repository is CHK, and the to_format doesn't match,
++        we should just fake an UnknownSmartMethod error so that the client
++        will fallback to VFS, rather than sending it a stream we know it
++        cannot handle.
++        """
          from_format = self._repository._format
          to_format = self._to_format
          if not from_format.supports_chks:
@@ -489,8 +492,7 @@
  class SmartServerRepositoryGetStream_1_18(SmartServerRepositoryGetStream):
      def _should_fake_unknown(self):
--        # The client is at least 1.18, so we don't need to work around any
--        # bugs.
++        """Returns False; we don't need to workaround bugs in 1.18+ clients."""
          return False
 === modified file 'bzrlib/tests/per_interrepository/__init__.py'
 --- bzrlib/tests/per_interrepository/__init__.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/tests/per_interrepository/__init__.py	2009-08-13 03:30:41 +0000
@@ -46,12 +46,12 @@
      """Transform the input formats to a list of scenarios.
      :param formats: A list of tuples:
--        (interrepo_class, repository_format, repository_format_to).
++        (label, repository_format, repository_format_to).
      """
      result = []
--    for repository_format, repository_format_to in formats:
--        id = '%s,%s' % (repository_format.__class__.__name__,
--                        repository_format_to.__class__.__name__)
++    for label, repository_format, repository_format_to in formats:
++        id = '%s,%s,%s' % (label, repository_format.__class__.__name__,
++                           repository_format_to.__class__.__name__)
          scenario = (id,
              {"transport_server": transport_server,
               "transport_readonly_server": transport_readonly_server,
@@ -71,8 +71,8 @@
          weaverepo,
+         )
      result = []
--    def add_combo(from_format, to_format):
--        result.append((from_format, to_format))
++    def add_combo(label, from_format, to_format):
++        result.append((label, from_format, to_format))
      # test the default InterRepository between format 6 and the current
      # default format.
      # XXX: robertc 20060220 reinstate this when there are two supported
@@ -83,32 +83,47 @@
      for optimiser_class in InterRepository._optimisers:
          format_to_test = optimiser_class._get_repo_format_to_test()
          if format_to_test is not None:
--            add_combo(format_to_test, format_to_test)
++            add_combo(optimiser_class.__name__, format_to_test, format_to_test)
      # if there are specific combinations we want to use, we can add them
      # here. We want to test rich root upgrading.
--    add_combo(weaverepo.RepositoryFormat5(),
--              knitrepo.RepositoryFormatKnit3())
--    add_combo(knitrepo.RepositoryFormatKnit1(),
--              knitrepo.RepositoryFormatKnit3())
--    add_combo(knitrepo.RepositoryFormatKnit1(),
++    # XXX: although we attach InterRepository class names to these scenarios,
++    # there's nothing asserting that these labels correspond to what it
++    # actually used.
++    add_combo('InterRepository',
++              weaverepo.RepositoryFormat5(),
++              knitrepo.RepositoryFormatKnit3())
++    add_combo('InterRepository',
++              knitrepo.RepositoryFormatKnit1(),
++              knitrepo.RepositoryFormatKnit3())
++    add_combo('InterKnitRepo',
++              knitrepo.RepositoryFormatKnit1(),
                pack_repo.RepositoryFormatKnitPack1())
--    add_combo(pack_repo.RepositoryFormatKnitPack1(),
++    add_combo('InterKnitRepo',
++              pack_repo.RepositoryFormatKnitPack1(),
                knitrepo.RepositoryFormatKnit1())
--    add_combo(knitrepo.RepositoryFormatKnit3(),
++    add_combo('InterKnitRepo',
++              knitrepo.RepositoryFormatKnit3(),
                pack_repo.RepositoryFormatKnitPack3())
--    add_combo(pack_repo.RepositoryFormatKnitPack3(),
++    add_combo('InterKnitRepo',
++              pack_repo.RepositoryFormatKnitPack3(),
                knitrepo.RepositoryFormatKnit3())
--    add_combo(pack_repo.RepositoryFormatKnitPack3(),
++    add_combo('InterKnitRepo',
++              pack_repo.RepositoryFormatKnitPack3(),
                pack_repo.RepositoryFormatKnitPack4())
--    add_combo(pack_repo.RepositoryFormatKnitPack1(),
--              pack_repo.RepositoryFormatKnitPack6RichRoot())
--    add_combo(pack_repo.RepositoryFormatKnitPack6RichRoot(),
--              groupcompress_repo.RepositoryFormat2a())
--    add_combo(groupcompress_repo.RepositoryFormat2a(),
--              pack_repo.RepositoryFormatKnitPack6RichRoot())
--    add_combo(groupcompress_repo.RepositoryFormatCHK2(),
--              groupcompress_repo.RepositoryFormat2a())
--    add_combo(groupcompress_repo.RepositoryFormatCHK1(),
++    add_combo('InterDifferingSerializer',
++              pack_repo.RepositoryFormatKnitPack1(),
++              pack_repo.RepositoryFormatKnitPack6RichRoot())
++    add_combo('InterDifferingSerializer',
++              pack_repo.RepositoryFormatKnitPack6RichRoot(),
++              groupcompress_repo.RepositoryFormat2a())
++    add_combo('InterDifferingSerializer',
++              groupcompress_repo.RepositoryFormat2a(),
++              pack_repo.RepositoryFormatKnitPack6RichRoot())
++    add_combo('InterRepository',
++              groupcompress_repo.RepositoryFormatCHK2(),
++              groupcompress_repo.RepositoryFormat2a())
++    add_combo('InterDifferingSerializer',
++              groupcompress_repo.RepositoryFormatCHK1(),
                groupcompress_repo.RepositoryFormat2a())
      return result
 === modified file 'bzrlib/tests/per_interrepository/test_fetch.py'
 --- bzrlib/tests/per_interrepository/test_fetch.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/tests/per_interrepository/test_fetch.py	2009-08-13 08:55:59 +0000
@@ -191,8 +191,19 @@
              ['left', 'right'])
          self.assertEqual(left_tree.inventory, stacked_left_tree.inventory)
          self.assertEqual(right_tree.inventory, stacked_right_tree.inventory)
--
++
++        # Finally, it's not enough to see that the basis inventories are
++        # present.  The texts introduced in merge (and only those) should be
++        # present, and also generating a stream should succeed without blowing
++        # up.
          self.assertTrue(unstacked_repo.has_revision('merge'))
++        expected_texts = set([('file-id', 'merge')])
++        if stacked_branch.repository.texts.get_parent_map([('root-id',
++            'merge')]):
++            # If a (root-id,merge) text exists, it should be in the stacked
++            # repo.
++            expected_texts.add(('root-id', 'merge'))
++        self.assertEqual(expected_texts, unstacked_repo.texts.keys())
          self.assertCanStreamRevision(unstacked_repo, 'merge')
      def assertCanStreamRevision(self, repo, revision_id):
@@ -241,6 +252,19 @@
          self.addCleanup(stacked_branch.unlock)
          stacked_second_tree = stacked_branch.repository.revision_tree('second')
          self.assertEqual(second_tree.inventory, stacked_second_tree.inventory)
++        # Finally, it's not enough to see that the basis inventories are
++        # present.  The texts introduced in merge (and only those) should be
++        # present, and also generating a stream should succeed without blowing
++        # up.
++        self.assertTrue(unstacked_repo.has_revision('third'))
++        expected_texts = set([('file-id', 'third')])
++        if stacked_branch.repository.texts.get_parent_map([('root-id',
++            'third')]):
++            # If a (root-id,third) text exists, it should be in the stacked
++            # repo.
++            expected_texts.add(('root-id', 'third'))
++        self.assertEqual(expected_texts, unstacked_repo.texts.keys())
++        self.assertCanStreamRevision(unstacked_repo, 'third')
      def test_fetch_missing_basis_text(self):
          """If fetching a delta, we should die if a basis is not present."""
 === modified file 'bzrlib/tests/per_pack_repository.py'
 --- bzrlib/tests/per_pack_repository.py	2009-08-12 22:28:28 +0000
 +++ bzrlib/tests/per_pack_repository.py	2009-08-13 03:29:52 +0000
@@ -1051,8 +1051,8 @@
          tree.branch.push(remote_branch)
          autopack_calls = len([call for call in self.hpss_calls if call ==
              'PackRepository.autopack'])
--        streaming_calls = len([call for call in self.hpss_calls if call ==
--            'Repository.insert_stream'])
++        streaming_calls = len([call for call in self.hpss_calls if call in
++            ('Repository.insert_stream', 'Repository.insert_stream_1.18')])
          if autopack_calls:
              # Non streaming server
              self.assertEqual(1, autopack_calls)
 === modified file 'bzrlib/tests/test_inventory_delta.py'
 --- bzrlib/tests/test_inventory_delta.py	2009-08-05 02:30:11 +0000
 +++ bzrlib/tests/test_inventory_delta.py	2009-08-11 06:52:07 +0000
@@ -26,6 +26,7 @@
      inventory,
      inventory_delta,
+     )
++from bzrlib.inventory_delta import InventoryDeltaError
  from bzrlib.inventory import Inventory
  from bzrlib.revision import NULL_REVISION
  from bzrlib.tests import TestCase
@@ -93,34 +94,34 @@
      """Test InventoryDeltaSerializer.parse_text_bytes."""
      def test_parse_no_bytes(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          err = self.assertRaises(
--            errors.BzrError, serializer.parse_text_bytes, '')
++            InventoryDeltaError, deserializer.parse_text_bytes, '')
          self.assertContainsRe(str(err), 'last line not empty')
      def test_parse_bad_format(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, 'format: foo\n')
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, 'format: foo\n')
          self.assertContainsRe(str(err), 'unknown format')
      def test_parse_no_parent(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes,
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes,
              'format: bzr inventory delta v1 (bzr 1.14)\n')
          self.assertContainsRe(str(err), 'missing parent: marker')
      def test_parse_no_version(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes,
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes,
              'format: bzr inventory delta v1 (bzr 1.14)\n'
              'parent: null:\n')
          self.assertContainsRe(str(err), 'missing version: marker')
      def test_parse_duplicate_key_errors(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          double_root_lines = \
  """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
@@ -130,14 +131,13 @@
  None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00
  None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, double_root_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, double_root_lines)
          self.assertContainsRe(str(err), 'duplicate file id')
      def test_parse_versioned_root_only(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
--        parse_result = serializer.parse_text_bytes(root_only_lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(root_only_lines)
          expected_entry = inventory.make_entry(
              'directory', u'', None, 'an-id')
          expected_entry.revision = 'a@e\xc3\xa5ample.com--2004'
@@ -147,8 +147,7 @@
              parse_result)
      def test_parse_special_revid_not_valid_last_mod(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=False, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          root_only_lines = """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
  version: null:
@@ -156,13 +155,12 @@
  tree_references: true
  None\x00/\x00TREE_ROOT\x00\x00null:\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, root_only_lines)
          self.assertContainsRe(str(err), 'special revisionid found')
      def test_parse_versioned_root_versioned_disabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=False, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          root_only_lines = """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
  version: null:
@@ -170,13 +168,12 @@
  tree_references: true
  None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, root_only_lines)
          self.assertContainsRe(str(err), 'Versioned root found')
      def test_parse_unique_root_id_root_versioned_disabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=False, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          root_only_lines = """format: bzr inventory delta v1 (bzr 1.14)
  parent: parent-id
  version: a@e\xc3\xa5ample.com--2004
@@ -184,29 +181,38 @@
  tree_references: true
  None\x00/\x00an-id\x00\x00parent-id\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, root_only_lines)
          self.assertContainsRe(str(err), 'Versioned root found')
      def test_parse_unversioned_root_versioning_enabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_unversioned)
--        self.assertContainsRe(
--            str(err), 'serialized versioned_root flag is wrong: False')
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(root_only_unversioned)
++        expected_entry = inventory.make_entry(
++            'directory', u'', None, 'TREE_ROOT')
++        expected_entry.revision = 'entry-version'
++        self.assertEqual(
++            ('null:', 'entry-version', False, False,
++             [(None, u'', 'TREE_ROOT', expected_entry)]),
++            parse_result)
++
++    def test_parse_versioned_root_when_disabled(self):
++        deserializer = inventory_delta.InventoryDeltaDeserializer(
++            allow_versioned_root=False)
++        err = self.assertRaises(inventory_delta.IncompatibleInventoryDelta,
++            deserializer.parse_text_bytes, root_only_lines)
++        self.assertEquals("versioned_root not allowed", str(err))
      def test_parse_tree_when_disabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=False)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, reference_lines)
--        self.assertContainsRe(
--            str(err), 'serialized tree_references flag is wrong: True')
++        deserializer = inventory_delta.InventoryDeltaDeserializer(
++            allow_tree_references=False)
++        err = self.assertRaises(inventory_delta.IncompatibleInventoryDelta,
++            deserializer.parse_text_bytes, reference_lines)
++        self.assertEquals("Tree reference not allowed", str(err))
      def test_parse_tree_when_header_disallows(self):
          # A deserializer that allows tree_references to be set or unset.
--        serializer = inventory_delta.InventoryDeltaSerializer()
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          # A serialised inventory delta with a header saying no tree refs, but
          # that has a tree ref in its content.
          lines = """format: bzr inventory delta v1 (bzr 1.14)
@@ -216,13 +222,13 @@
  tree_references: false
  None\x00/foo\x00id\x00TREE_ROOT\x00changed\x00tree\x00subtree-version
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
          self.assertContainsRe(str(err), 'Tree reference found')
      def test_parse_versioned_root_when_header_disallows(self):
          # A deserializer that allows tree_references to be set or unset.
--        serializer = inventory_delta.InventoryDeltaSerializer()
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          # A serialised inventory delta with a header saying no tree refs, but
          # that has a tree ref in its content.
          lines = """format: bzr inventory delta v1 (bzr 1.14)
@@ -232,35 +238,35 @@
  tree_references: false
  None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
          self.assertContainsRe(str(err), 'Versioned root found')
      def test_parse_last_line_not_empty(self):
          """newpath must start with / if it is not None."""
          # Trim the trailing newline from a valid serialization
          lines = root_only_lines[:-1]
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
          self.assertContainsRe(str(err), 'last line not empty')
      def test_parse_invalid_newpath(self):
          """newpath must start with / if it is not None."""
          lines = empty_lines
          lines += "None\x00bad\x00TREE_ROOT\x00\x00version\x00dir\n"
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
          self.assertContainsRe(str(err), 'newpath invalid')
      def test_parse_invalid_oldpath(self):
          """oldpath must start with / if it is not None."""
          lines = root_only_lines
          lines += "bad\x00/new\x00file-id\x00\x00version\x00dir\n"
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
          self.assertContainsRe(str(err), 'oldpath invalid')
      def test_parse_new_file(self):
@@ -270,8 +276,8 @@
          lines += (
              "None\x00/new\x00file-id\x00an-id\x00version\x00file\x00123\x00" +
              "\x00" + fake_sha + "\n")
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        parse_result = serializer.parse_text_bytes(lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(lines)
          expected_entry = inventory.make_entry(
              'file', u'new', 'an-id', 'file-id')
          expected_entry.revision = 'version'
@@ -285,8 +291,8 @@
          lines = root_only_lines
          lines += (
              "/old-file\x00None\x00deleted-id\x00\x00null:\x00deleted\x00\x00\n")
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        parse_result = serializer.parse_text_bytes(lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(lines)
          delta = parse_result[4]
          self.assertEqual(
               (u'old-file', None, 'deleted-id', None), delta[-1])
@@ -299,8 +305,8 @@
          old_inv = Inventory(None)
          new_inv = Inventory(None)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(True, True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          self.assertEqual(StringIO(empty_lines).readlines(),
              serializer.delta_to_lines(NULL_REVISION, NULL_REVISION, delta))
@@ -311,8 +317,8 @@
          root.revision = 'a@e\xc3\xa5ample.com--2004'
          new_inv.add(root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          self.assertEqual(StringIO(root_only_lines).readlines(),
              serializer.delta_to_lines(NULL_REVISION, 'entry-version', delta))
@@ -324,15 +330,16 @@
          root.revision = 'entry-version'
          new_inv.add(root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(False, False)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=False, tree_references=False)
          serialized_lines = serializer.delta_to_lines(
              NULL_REVISION, 'entry-version', delta)
          self.assertEqual(StringIO(root_only_unversioned).readlines(),
              serialized_lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          self.assertEqual(
              (NULL_REVISION, 'entry-version', False, False, delta),
--            serializer.parse_text_bytes(''.join(serialized_lines)))
++            deserializer.parse_text_bytes(''.join(serialized_lines)))
      def test_unversioned_non_root_errors(self):
          old_inv = Inventory(None)
@@ -343,9 +350,9 @@
          non_root = new_inv.make_entry('directory', 'foo', root.file_id, 'id')
          new_inv.add(non_root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
++        err = self.assertRaises(InventoryDeltaError,
              serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
          self.assertEqual(str(err), 'no version for fileid id')
@@ -355,9 +362,9 @@
          root = new_inv.make_entry('directory', '', None, 'TREE_ROOT')
          new_inv.add(root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
++        err = self.assertRaises(InventoryDeltaError,
              serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
          self.assertEqual(str(err), 'no version for fileid TREE_ROOT')
@@ -368,9 +375,9 @@
          root.revision = 'a@e\xc3\xa5ample.com--2004'
          new_inv.add(root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=False, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=False, tree_references=True)
++        err = self.assertRaises(InventoryDeltaError,
              serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
          self.assertStartsWith(str(err), 'Version present for / in TREE_ROOT')
@@ -385,8 +392,8 @@
          non_root.kind = 'strangelove'
          new_inv.add(non_root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(True, True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          # we expect keyerror because there is little value wrapping this.
          # This test aims to prove that it errors more than how it errors.
          err = self.assertRaises(KeyError,
@@ -405,8 +412,8 @@
          non_root.reference_revision = 'subtree-version'
          new_inv.add(non_root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=False)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=False)
          # we expect keyerror because there is little value wrapping this.
          # This test aims to prove that it errors more than how it errors.
          err = self.assertRaises(KeyError,
@@ -425,8 +432,8 @@
          non_root.reference_revision = 'subtree-version'
          new_inv.add(non_root)
          delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          self.assertEqual(StringIO(reference_lines).readlines(),
              serializer.delta_to_lines(NULL_REVISION, 'entry-version', delta))
@@ -434,19 +441,19 @@
          root_entry = inventory.make_entry('directory', '', None, 'TREE_ROOT')
          root_entry.revision = 'some-version'
          delta = [(None, '', 'TREE_ROOT', root_entry)]
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=False, tree_references=True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=False, tree_references=True)
          self.assertRaises(
--            errors.BzrError, serializer.delta_to_lines, 'old-version',
++            InventoryDeltaError, serializer.delta_to_lines, 'old-version',
              'new-version', delta)
      def test_to_inventory_root_id_not_versioned(self):
          delta = [(None, '', 'an-id', inventory.make_entry(
              'directory', '', None, 'an-id'))]
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          self.assertRaises(
--            errors.BzrError, serializer.delta_to_lines, 'old-version',
++            InventoryDeltaError, serializer.delta_to_lines, 'old-version',
              'new-version', delta)
      def test_to_inventory_has_tree_not_meant_to(self):
@@ -459,9 +466,9 @@
              (None, 'foo', 'ref-id', tree_ref)
              # a file that followed the root move
+             ]
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
--        self.assertRaises(errors.BzrError, serializer.delta_to_lines,
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
++        self.assertRaises(InventoryDeltaError, serializer.delta_to_lines,
              'old-version', 'new-version', delta)
      def test_to_inventory_torture(self):
@@ -511,8 +518,8 @@
                     executable=True, text_size=30, text_sha1='some-sha',
                     revision='old-rev')),
+             ]
--        serializer = inventory_delta.InventoryDeltaSerializer()
--        serializer.require_flags(versioned_root=True, tree_references=True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          lines = serializer.delta_to_lines(NULL_REVISION, 'something', delta)
          expected = """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
@@ -565,13 +572,13 @@
      def test_file_without_size(self):
          file_entry = inventory.make_entry('file', 'a file', None, 'file-id')
          file_entry.text_sha1 = 'foo'
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._file_content, file_entry)
      def test_file_without_sha1(self):
          file_entry = inventory.make_entry('file', 'a file', None, 'file-id')
          file_entry.text_size = 10
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._file_content, file_entry)
      def test_link_empty_target(self):
@@ -594,7 +601,7 @@
      def test_link_no_target(self):
          entry = inventory.make_entry('symlink', 'a link', None)
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._link_content, entry)
      def test_reference_null(self):
@@ -611,5 +618,5 @@
      def test_reference_no_reference(self):
          entry = inventory.make_entry('tree-reference', 'a tree', None)
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._reference_content, entry)
 === modified file 'bzrlib/tests/test_remote.py'
 --- bzrlib/tests/test_remote.py	2009-08-13 12:51:59 +0000
 +++ bzrlib/tests/test_remote.py	2009-08-13 03:29:52 +0000
@@ -2213,7 +2213,26 @@
          self.assertEqual([], client._calls)
--class TestRepositoryInsertStream(TestRemoteRepository):
++class TestRepositoryInsertStreamBase(TestRemoteRepository):
++    """Base class for Repository.insert_stream and .insert_stream_1.18
++    tests.
++    """
++
++    def checkInsertEmptyStream(self, repo, client):
++        """Insert an empty stream, checking the result.
++
++        This checks that there are no resume_tokens or missing_keys, and that
++        the client is finished.
++        """
++        sink = repo._get_sink()
++        fmt = repository.RepositoryFormat.get_default_format()
++        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
++        self.assertEqual([], resume_tokens)
++        self.assertEqual(set(), missing_keys)
++        self.assertFinished(client)
++
++
++class TestRepositoryInsertStream(TestRepositoryInsertStreamBase):
      """Tests for using Repository.insert_stream verb when the _1.18 variant is
      not available.
@@ -2236,12 +2255,7 @@
          client.add_expected_call(
              'Repository.insert_stream', ('quack/', ''),
              'success', ('ok',))
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++        self.checkInsertEmptyStream(repo, client)
      def test_locked_repo_with_no_lock_token(self):
          transport_path = 'quack'
@@ -2259,12 +2273,7 @@
              'Repository.insert_stream', ('quack/', ''),
              'success', ('ok',))
          repo.lock_write()
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++        self.checkInsertEmptyStream(repo, client)
      def test_locked_repo_with_lock_token(self):
          transport_path = 'quack'
@@ -2282,18 +2291,13 @@
              'Repository.insert_stream_locked', ('quack/', '', 'a token'),
              'success', ('ok',))
          repo.lock_write()
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++        self.checkInsertEmptyStream(repo, client)
      def test_stream_with_inventory_deltas(self):
--        """'inventory-deltas' substreams can't be sent to the
--        Repository.insert_stream verb.  So when one is encountered the
--        RemoteSink immediately stops using that verb and falls back to VFS
--        insert_stream.
++        """'inventory-deltas' substreams cannot be sent to the
++        Repository.insert_stream verb, because not all servers that implement
++        that verb will accept them.  So when one is encountered the RemoteSink
++        immediately stops using that verb and falls back to VFS insert_stream.
          """
          transport_path = 'quack'
          repo, client = self.setup_fake_client_and_repository(transport_path)
@@ -2368,8 +2372,8 @@
                  'directory', 'newdir', inv.root.file_id, 'newdir-id')
              entry.revision = 'ghost'
              delta = [(None, 'newdir', 'newdir-id', entry)]
--            serializer = inventory_delta.InventoryDeltaSerializer()
--            serializer.require_flags(True, False)
++            serializer = inventory_delta.InventoryDeltaSerializer(
++                versioned_root=True, tree_references=False)
              lines = serializer.delta_to_lines('rev1', 'rev2', delta)
              yield versionedfile.ChunkedContentFactory(
                  ('rev2',), (('rev1',)), None, lines)
@@ -2380,7 +2384,7 @@
          return stream_with_inv_delta()
--class TestRepositoryInsertStream_1_18(TestRemoteRepository):
++class TestRepositoryInsertStream_1_18(TestRepositoryInsertStreamBase):
      def test_unlocked_repo(self):
          transport_path = 'quack'
@@ -2391,12 +2395,7 @@
          client.add_expected_call(
              'Repository.insert_stream_1.18', ('quack/', ''),
              'success', ('ok',))
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++        self.checkInsertEmptyStream(repo, client)
      def test_locked_repo_with_no_lock_token(self):
          transport_path = 'quack'
@@ -2411,12 +2410,7 @@
              'Repository.insert_stream_1.18', ('quack/', ''),
              'success', ('ok',))
          repo.lock_write()
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++        self.checkInsertEmptyStream(repo, client)
      def test_locked_repo_with_lock_token(self):
          transport_path = 'quack'
@@ -2431,12 +2425,7 @@
              'Repository.insert_stream_1.18', ('quack/', '', 'a token'),
              'success', ('ok',))
          repo.lock_write()
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++        self.checkInsertEmptyStream(repo, client)
  class TestRepositoryTarball(TestRemoteRepository):
 === modified file 'bzrlib/tests/test_smart.py'
 --- bzrlib/tests/test_smart.py	2009-07-23 07:37:05 +0000
 +++ bzrlib/tests/test_smart.py	2009-08-07 02:26:45 +0000
@@ -1242,6 +1242,7 @@
              SmartServerResponse(('history-incomplete', 2, r2)),
              request.execute('stacked', 1, (3, r3)))
++
  class TestSmartServerRepositoryGetStream(tests.TestCaseWithMemoryTransport):
      def make_two_commit_repo(self):

 === modified file 'NEWS'
 --- NEWS	2009-08-13 22:32:23 +0000
 +++ NEWS	2009-08-14 05:35:32 +0000
@@ -24,6 +24,15 @@
  Bug Fixes
  *********
++* Fetching from 2a branches from a version-2 bzr protocol would fail to
++  copy the internal inventory pages from the CHK store. This cannot happen
++  in normal use as all 2a compatible clients and servers support the
++  version-3 protocol, but it does cause test suite failures when testing
++  downlevel protocol behaviour. (Robert Collins)
++
++* Fixed "Pack ... already exists" error when running ``bzr pack`` on a
++  fully packed 2a repository.  (Andrew Bennetts, #382463)
++
  * Further tweaks to handling of ``bzr add`` messages about ignored files.
    (Jason Spashett, #76616)
@@ -32,15 +41,22 @@
    parent inventories incorrectly, and also not handling when one of the
    parents was a ghost. (John Arbash Meinel, #402778, #412198)
--* Fetching from 2a branches from a version-2 bzr protocol would fail to
--  copy the internal inventory pages from the CHK store. This cannot happen
--  in normal use as all 2a compatible clients and servers support the
--  version-3 protocol, but it does cause test suite failures when testing
--  downlevel protocol behaviour. (Robert Collins)
++* ``RemoteStreamSource.get_stream_for_missing_keys`` will fetch CHK
++  inventory pages when appropriate (by falling back to the vfs stream
++  source).  (Andrew Bennetts, #406686)
++
++* StreamSource generates rich roots from non-rich root sources correctly
++  now.  (Andrew Bennetts, #368921)
  Improvements
  ************
++* Cross-format fetches (such as between 1.9-rich-root and 2a) via the
++  smart server are more efficient now.  They send inventory deltas rather
++  than full inventories.  The smart server has two new requests,
++  ``Repository.get_stream_1.19`` and ``Repository.insert_stream_1.19`` to
++  support this.  (Andrew Bennetts, #374738, #385826)
++
  Documentation
  *************
@@ -50,6 +66,11 @@
  Internals
  *********
++* InterDifferingSerializer is now only used locally.  Other fetches that
++  would have used InterDifferingSerializer now use the more network
++  friendly StreamSource, which now automatically does the same
++  transformations as InterDifferingSerializer.  (Andrew Bennetts)
++
  * RemoteBranch.open now honours ignore_fallbacks correctly on bzr-v2
    protocols. (Robert Collins)
 === modified file 'bzrlib/fetch.py'
 --- bzrlib/fetch.py	2009-07-09 08:59:51 +0000
 +++ bzrlib/fetch.py	2009-08-14 05:35:32 +0000
@@ -25,16 +25,21 @@
  import operator
++from bzrlib.lazy_import import lazy_import
++lazy_import(globals(), """
++from bzrlib import (
++    tsort,
++    versionedfile,
++    )
++""")
  import bzrlib
  from bzrlib import (
      errors,
      symbol_versioning,
+     )
  from bzrlib.revision import NULL_REVISION
--from bzrlib.tsort import topo_sort
  from bzrlib.trace import mutter
  import bzrlib.ui
--from bzrlib.versionedfile import FulltextContentFactory
  class RepoFetcher(object):
@@ -213,11 +218,9 @@
      def _find_root_ids(self, revs, parent_map, graph):
          revision_root = {}
--        planned_versions = {}
          for tree in self.iter_rev_trees(revs):
              revision_id = tree.inventory.root.revision
              root_id = tree.get_root_id()
--            planned_versions.setdefault(root_id, []).append(revision_id)
              revision_root[revision_id] = root_id
          # Find out which parents we don't already know root ids for
          parents = set()
@@ -229,7 +232,7 @@
          for tree in self.iter_rev_trees(parents):
              root_id = tree.get_root_id()
              revision_root[tree.get_revision_id()] = root_id
--        return revision_root, planned_versions
++        return revision_root
      def generate_root_texts(self, revs):
          """Generate VersionedFiles for all root ids.
@@ -238,9 +241,8 @@
          """
          graph = self.source.get_graph()
          parent_map = graph.get_parent_map(revs)
--        rev_order = topo_sort(parent_map)
--        rev_id_to_root_id, root_id_to_rev_ids = self._find_root_ids(
--            revs, parent_map, graph)
++        rev_order = tsort.topo_sort(parent_map)
++        rev_id_to_root_id = self._find_root_ids(revs, parent_map, graph)
          root_id_order = [(rev_id_to_root_id[rev_id], rev_id) for rev_id in
              rev_order]
          # Guaranteed stable, this groups all the file id operations together
@@ -249,20 +251,93 @@
          # yet, and are unlikely to in non-rich-root environments anyway.
          root_id_order.sort(key=operator.itemgetter(0))
          # Create a record stream containing the roots to create.
--        def yield_roots():
--            for key in root_id_order:
--                root_id, rev_id = key
--                rev_parents = parent_map[rev_id]
--                # We drop revision parents with different file-ids, because
--                # that represents a rename of the root to a different location
--                # - its not actually a parent for us. (We could look for that
--                # file id in the revision tree at considerably more expense,
--                # but for now this is sufficient (and reconcile will catch and
--                # correct this anyway).
--                # When a parent revision is a ghost, we guess that its root id
--                # was unchanged (rather than trimming it from the parent list).
--                parent_keys = tuple((root_id, parent) for parent in rev_parents
--                    if parent != NULL_REVISION and
--                        rev_id_to_root_id.get(parent, root_id) == root_id)
--                yield FulltextContentFactory(key, parent_keys, None, '')
--        return [('texts', yield_roots())]
++        from bzrlib.graph import FrozenHeadsCache
++        graph = FrozenHeadsCache(graph)
++        new_roots_stream = _new_root_data_stream(
++            root_id_order, rev_id_to_root_id, parent_map, self.source, graph)
++        return [('texts', new_roots_stream)]
++
++
++def _new_root_data_stream(
++    root_keys_to_create, rev_id_to_root_id_map, parent_map, repo, graph=None):
++    """Generate a texts substream of synthesised root entries.
++
++    Used in fetches that do rich-root upgrades.
++
++    :param root_keys_to_create: iterable of (root_id, rev_id) pairs describing
++        the root entries to create.
++    :param rev_id_to_root_id_map: dict of known rev_id -> root_id mappings for
++        calculating the parents.  If a parent rev_id is not found here then it
++        will be recalculated.
++    :param parent_map: a parent map for all the revisions in
++        root_keys_to_create.
++    :param graph: a graph to use instead of repo.get_graph().
++    """
++    for root_key in root_keys_to_create:
++        root_id, rev_id = root_key
++        parent_keys = _parent_keys_for_root_version(
++            root_id, rev_id, rev_id_to_root_id_map, parent_map, repo, graph)
++        yield versionedfile.FulltextContentFactory(
++            root_key, parent_keys, None, '')
++
++
++def _parent_keys_for_root_version(
++    root_id, rev_id, rev_id_to_root_id_map, parent_map, repo, graph=None):
++    """Get the parent keys for a given root id.
++
++    A helper function for _new_root_data_stream.
++    """
++    # Include direct parents of the revision, but only if they used the same
++    # root_id and are heads.
++    rev_parents = parent_map[rev_id]
++    parent_ids = []
++    for parent_id in rev_parents:
++        if parent_id == NULL_REVISION:
++            continue
++        if parent_id not in rev_id_to_root_id_map:
++            # We probably didn't read this revision, go spend the extra effort
++            # to actually check
++            try:
++                tree = repo.revision_tree(parent_id)
++            except errors.NoSuchRevision:
++                # Ghost, fill out rev_id_to_root_id in case we encounter this
++                # again.
++                # But set parent_root_id to None since we don't really know
++                parent_root_id = None
++            else:
++                parent_root_id = tree.get_root_id()
++            rev_id_to_root_id_map[parent_id] = None
++            # XXX: why not:
++            #   rev_id_to_root_id_map[parent_id] = parent_root_id
++            # memory consumption maybe?
++        else:
++            parent_root_id = rev_id_to_root_id_map[parent_id]
++        if root_id == parent_root_id:
++            # With stacking we _might_ want to refer to a non-local revision,
++            # but this code path only applies when we have the full content
++            # available, so ghosts really are ghosts, not just the edge of
++            # local data.
++            parent_ids.append(parent_id)
++        else:
++            # root_id may be in the parent anyway.
++            try:
++                tree = repo.revision_tree(parent_id)
++            except errors.NoSuchRevision:
++                # ghost, can't refer to it.
++                pass
++            else:
++                try:
++                    parent_ids.append(tree.inventory[root_id].revision)
++                except errors.NoSuchId:
++                    # not in the tree
++                    pass
++    # Drop non-head parents
++    if graph is None:
++        graph = repo.get_graph()
++    heads = graph.heads(parent_ids)
++    selected_ids = []
++    for parent_id in parent_ids:
++        if parent_id in heads and parent_id not in selected_ids:
++            selected_ids.append(parent_id)
++    parent_keys = [(root_id, parent_id) for parent_id in selected_ids]
++    return parent_keys
 === modified file 'bzrlib/help_topics/en/debug-flags.txt'
 --- bzrlib/help_topics/en/debug-flags.txt	2009-08-04 04:36:34 +0000
 +++ bzrlib/help_topics/en/debug-flags.txt	2009-08-14 05:35:32 +0000
@@ -12,6 +12,7 @@
                    operations.
  -Dfetch           Trace history copying between repositories.
  -Dfilters         Emit information for debugging content filtering.
++-Dforceinvdeltas  Force use of inventory deltas during generic streaming fetch.
  -Dgraph           Trace graph traversal.
  -Dhashcache       Log every time a working file is read to determine its hash.
  -Dhooks           Trace hook execution.
@@ -27,3 +28,7 @@
  -Dunlock          Some errors during unlock are treated as warnings.
  -Dpack            Emit information about pack operations.
  -Dsftp            Trace SFTP internals.
++-Dstream          Trace fetch streams.
++-DIDS_never       Never use InterDifferingSerializer when fetching.
++-DIDS_always      Always use InterDifferingSerializer to fetch if appropriate
++                  for the format, even for non-local fetches.
 === modified file 'bzrlib/inventory_delta.py'
 --- bzrlib/inventory_delta.py	2009-04-02 05:53:12 +0000
 +++ bzrlib/inventory_delta.py	2009-08-14 05:35:32 +0000
@@ -29,6 +29,25 @@
  from bzrlib import inventory
  from bzrlib.revision import NULL_REVISION
++FORMAT_1 = 'bzr inventory delta v1 (bzr 1.14)'
++
++
++class InventoryDeltaError(errors.BzrError):
++    """An error when serializing or deserializing an inventory delta."""
++
++    # Most errors when serializing and deserializing are due to bugs, although
++    # damaged input (i.e. a bug in a different process) could cause
++    # deserialization errors too.
++    internal_error = True
++
++
++class IncompatibleInventoryDelta(errors.BzrError):
++    """The delta could not be deserialised because its contents conflict with
++    the allow_versioned_root or allow_tree_references flags of the
++    deserializer.
++    """
++    internal_error = False
++
  def _directory_content(entry):
      """Serialize the content component of entry which is a directory.
@@ -49,7 +68,7 @@
          exec_bytes = ''
      size_exec_sha = (entry.text_size, exec_bytes, entry.text_sha1)
      if None in size_exec_sha:
--        raise errors.BzrError('Missing size or sha for %s' % entry.file_id)
++        raise InventoryDeltaError('Missing size or sha for %s' % entry.file_id)
      return "file\x00%d\x00%s\x00%s" % size_exec_sha
@@ -60,7 +79,7 @@
      """
      target = entry.symlink_target
      if target is None:
--        raise errors.BzrError('Missing target for %s' % entry.file_id)
++        raise InventoryDeltaError('Missing target for %s' % entry.file_id)
      return "link\x00%s" % target.encode('utf8')
@@ -71,7 +90,8 @@
      """
      tree_revision = entry.reference_revision
      if tree_revision is None:
--        raise errors.BzrError('Missing reference revision for %s' % entry.file_id)
++        raise InventoryDeltaError(
++            'Missing reference revision for %s' % entry.file_id)
      return "tree\x00%s" % tree_revision
@@ -115,11 +135,8 @@
      return result
--
  class InventoryDeltaSerializer(object):
--    """Serialize and deserialize inventory deltas."""
--
--    FORMAT_1 = 'bzr inventory delta v1 (bzr 1.14)'
++    """Serialize inventory deltas."""
      def __init__(self, versioned_root, tree_references):
          """Create an InventoryDeltaSerializer.
@@ -141,6 +158,9 @@
      def delta_to_lines(self, old_name, new_name, delta_to_new):
          """Return a line sequence for delta_to_new.
++        Both the versioned_root and tree_references flags must be set via
++        require_flags before calling this.
++
          :param old_name: A UTF8 revision id for the old inventory.  May be
              NULL_REVISION if there is no older inventory and delta_to_new
              includes the entire inventory contents.
@@ -150,15 +170,20 @@
              takes.
          :return: The serialized delta as lines.
          """
++        if type(old_name) is not str:
++            raise TypeError('old_name should be str, got %r' % (old_name,))
++        if type(new_name) is not str:
++            raise TypeError('new_name should be str, got %r' % (new_name,))
          lines = ['', '', '', '', '']
          to_line = self._delta_item_to_line
          for delta_item in delta_to_new:
--            lines.append(to_line(delta_item))
--            if lines[-1].__class__ != str:
--                raise errors.BzrError(
++            line = to_line(delta_item, new_name)
++            if line.__class__ != str:
++                raise InventoryDeltaError(
                      'to_line generated non-str output %r' % lines[-1])
++            lines.append(line)
          lines.sort()
--        lines[0] = "format: %s\n" % InventoryDeltaSerializer.FORMAT_1
++        lines[0] = "format: %s\n" % FORMAT_1
          lines[1] = "parent: %s\n" % old_name
          lines[2] = "version: %s\n" % new_name
          lines[3] = "versioned_root: %s\n" % self._serialize_bool(
@@ -173,7 +198,7 @@
          else:
              return "false"
--    def _delta_item_to_line(self, delta_item):
++    def _delta_item_to_line(self, delta_item, new_version):
          """Convert delta_item to a line."""
          oldpath, newpath, file_id, entry = delta_item
          if newpath is None:
@@ -188,6 +213,10 @@
                  oldpath_utf8 = 'None'
              else:
                  oldpath_utf8 = '/' + oldpath.encode('utf8')
++            if newpath == '/':
++                raise AssertionError(
++                    "Bad inventory delta: '/' is not a valid newpath "
++                    "(should be '') in delta item %r" % (delta_item,))
              # TODO: Test real-world utf8 cache hit rate. It may be a win.
              newpath_utf8 = '/' + newpath.encode('utf8')
              # Serialize None as ''
@@ -196,58 +225,78 @@
              last_modified = entry.revision
              # special cases for /
              if newpath_utf8 == '/' and not self._versioned_root:
--                if file_id != 'TREE_ROOT':
--                    raise errors.BzrError(
--                        'file_id %s is not TREE_ROOT for /' % file_id)
--                if last_modified is not None:
--                    raise errors.BzrError(
--                        'Version present for / in %s' % file_id)
--                last_modified = NULL_REVISION
++                # This is an entry for the root, this inventory does not
++                # support versioned roots.  So this must be an unversioned
++                # root, i.e. last_modified == new revision.  Otherwise, this
++                # delta is invalid.
++                # Note: the non-rich-root repositories *can* have roots with
++                # file-ids other than TREE_ROOT, e.g. repo formats that use the
++                # xml5 serializer.
++                if last_modified != new_version:
++                    raise InventoryDeltaError(
++                        'Version present for / in %s (%s != %s)'
++                        % (file_id, last_modified, new_version))
              if last_modified is None:
--                raise errors.BzrError("no version for fileid %s" % file_id)
++                raise InventoryDeltaError("no version for fileid %s" % file_id)
              content = self._entry_to_content[entry.kind](entry)
          return ("%s\x00%s\x00%s\x00%s\x00%s\x00%s\n" %
              (oldpath_utf8, newpath_utf8, file_id, parent_id, last_modified,
                  content))
++
++class InventoryDeltaDeserializer(object):
++    """Deserialize inventory deltas."""
++
++    def __init__(self, allow_versioned_root=True, allow_tree_references=True):
++        """Create an InventoryDeltaDeserializer.
++
++        :param versioned_root: If True, any root entry that is seen is expected
++            to be versioned, and root entries can have any fileid.
++        :param tree_references: If True support tree-reference entries.
++        """
++        self._allow_versioned_root = allow_versioned_root
++        self._allow_tree_references = allow_tree_references
++
      def _deserialize_bool(self, value):
          if value == "true":
              return True
          elif value == "false":
              return False
          else:
--            raise errors.BzrError("value %r is not a bool" % (value,))
++            raise InventoryDeltaError("value %r is not a bool" % (value,))
      def parse_text_bytes(self, bytes):
          """Parse the text bytes of a serialized inventory delta.
++        If versioned_root and/or tree_references flags were set via
++        require_flags, then the parsed flags must match or a BzrError will be
++        raised.
++
          :param bytes: The bytes to parse. This can be obtained by calling
              delta_to_lines and then doing ''.join(delta_lines).
--        :return: (parent_id, new_id, inventory_delta)
++        :return: (parent_id, new_id, versioned_root, tree_references,
++            inventory_delta)
          """
++        if bytes[-1:] != '\n':
++            last_line = bytes.rsplit('\n', 1)[-1]
++            raise InventoryDeltaError('last line not empty: %r' % (last_line,))
          lines = bytes.split('\n')[:-1] # discard the last empty line
--        if not lines or lines[0] != 'format: %s' % InventoryDeltaSerializer.FORMAT_1:
--            raise errors.BzrError('unknown format %r' % lines[0:1])
++        if not lines or lines[0] != 'format: %s' % FORMAT_1:
++            raise InventoryDeltaError('unknown format %r' % lines[0:1])
          if len(lines) < 2 or not lines[1].startswith('parent: '):
--            raise errors.BzrError('missing parent: marker')
++            raise InventoryDeltaError('missing parent: marker')
          delta_parent_id = lines[1][8:]
          if len(lines) < 3 or not lines[2].startswith('version: '):
--            raise errors.BzrError('missing version: marker')
++            raise InventoryDeltaError('missing version: marker')
          delta_version_id = lines[2][9:]
          if len(lines) < 4 or not lines[3].startswith('versioned_root: '):
--            raise errors.BzrError('missing versioned_root: marker')
++            raise InventoryDeltaError('missing versioned_root: marker')
          delta_versioned_root = self._deserialize_bool(lines[3][16:])
          if len(lines) < 5 or not lines[4].startswith('tree_references: '):
--            raise errors.BzrError('missing tree_references: marker')
++            raise InventoryDeltaError('missing tree_references: marker')
          delta_tree_references = self._deserialize_bool(lines[4][17:])
--        if delta_versioned_root != self._versioned_root:
--            raise errors.BzrError(
--                "serialized versioned_root flag is wrong: %s" %
--                (delta_versioned_root,))
--        if delta_tree_references != self._tree_references:
--            raise errors.BzrError(
--                "serialized tree_references flag is wrong: %s" %
--                (delta_tree_references,))
++        if (not self._allow_versioned_root and delta_versioned_root):
++            raise IncompatibleInventoryDelta("versioned_root not allowed")
          result = []
          seen_ids = set()
          line_iter = iter(lines)
@@ -258,33 +307,58 @@
                  content) = line.split('\x00', 5)
              parent_id = parent_id or None
              if file_id in seen_ids:
--                raise errors.BzrError(
++                raise InventoryDeltaError(
                      "duplicate file id in inventory delta %r" % lines)
              seen_ids.add(file_id)
--            if newpath_utf8 == '/' and not delta_versioned_root and (
--                last_modified != 'null:' or file_id != 'TREE_ROOT'):
--                    raise errors.BzrError("Versioned root found: %r" % line)
--            elif last_modified[-1] == ':':
--                    raise errors.BzrError('special revisionid found: %r' % line)
--            if not delta_tree_references and content.startswith('tree\x00'):
--                raise errors.BzrError("Tree reference found: %r" % line)
--            content_tuple = tuple(content.split('\x00'))
--            entry = _parse_entry(
--                newpath_utf8, file_id, parent_id, last_modified, content_tuple)
++            if (newpath_utf8 == '/' and not delta_versioned_root and
++                last_modified != delta_version_id):
++                    # Delta claims to be not have a versioned root, yet here's
++                    # a root entry with a non-default version.
++                    raise InventoryDeltaError("Versioned root found: %r" % line)
++            elif newpath_utf8 != 'None' and last_modified[-1] == ':':
++                # Deletes have a last_modified of null:, but otherwise special
++                # revision ids should not occur.
++                raise InventoryDeltaError('special revisionid found: %r' % line)
++            if content.startswith('tree\x00'):
++                if delta_tree_references is False:
++                    raise InventoryDeltaError(
++                            "Tree reference found (but header said "
++                            "tree_references: false): %r" % line)
++                elif not self._allow_tree_references:
++                    raise IncompatibleInventoryDelta(
++                        "Tree reference not allowed")
              if oldpath_utf8 == 'None':
                  oldpath = None
++            elif oldpath_utf8[:1] != '/':
++                raise InventoryDeltaError(
++                    "oldpath invalid (does not start with /): %r"
++                    % (oldpath_utf8,))
              else:
++                oldpath_utf8 = oldpath_utf8[1:]
                  oldpath = oldpath_utf8.decode('utf8')
              if newpath_utf8 == 'None':
                  newpath = None
++            elif newpath_utf8[:1] != '/':
++                raise InventoryDeltaError(
++                    "newpath invalid (does not start with /): %r"
++                    % (newpath_utf8,))
              else:
++                # Trim leading slash
++                newpath_utf8 = newpath_utf8[1:]
                  newpath = newpath_utf8.decode('utf8')
++            content_tuple = tuple(content.split('\x00'))
++            if content_tuple[0] == 'deleted':
++                entry = None
++            else:
++                entry = _parse_entry(
++                    newpath, file_id, parent_id, last_modified, content_tuple)
              delta_item = (oldpath, newpath, file_id, entry)
              result.append(delta_item)
--        return delta_parent_id, delta_version_id, result
--
--
--def _parse_entry(utf8_path, file_id, parent_id, last_modified, content):
++        return (delta_parent_id, delta_version_id, delta_versioned_root,
++                delta_tree_references, result)
++
++
++def _parse_entry(path, file_id, parent_id, last_modified, content):
      entry_factory = {
          'dir': _dir_to_entry,
          'file': _file_to_entry,
@@ -292,8 +366,10 @@
          'tree': _tree_to_entry,
+     }
      kind = content[0]
--    path = utf8_path[1:].decode('utf8')
++    if path.startswith('/'):
++        raise AssertionError
      name = basename(path)
      return entry_factory[content[0]](
              content, name, parent_id, file_id, last_modified)
++
 === modified file 'bzrlib/remote.py'
 --- bzrlib/remote.py	2009-08-13 22:32:23 +0000
 +++ bzrlib/remote.py	2009-08-14 05:35:32 +0000
@@ -426,6 +426,7 @@
          self._custom_format = None
          self._network_name = None
          self._creating_bzrdir = None
++        self._supports_chks = None
          self._supports_external_lookups = None
          self._supports_tree_reference = None
          self._rich_root_data = None
@@ -443,6 +444,13 @@
          return self._rich_root_data
      @property
++    def supports_chks(self):
++        if self._supports_chks is None:
++            self._ensure_real()
++            self._supports_chks = self._custom_format.supports_chks
++        return self._supports_chks
++
++    @property
      def supports_external_lookups(self):
          if self._supports_external_lookups is None:
              self._ensure_real()
@@ -1178,9 +1186,9 @@
          self._ensure_real()
          return self._real_repository.get_inventory(revision_id)
--    def iter_inventories(self, revision_ids):
++    def iter_inventories(self, revision_ids, ordering=None):
          self._ensure_real()
--        return self._real_repository.iter_inventories(revision_ids)
++        return self._real_repository.iter_inventories(revision_ids, ordering)
      @needs_read_lock
      def get_revision(self, revision_id):
@@ -1682,43 +1690,61 @@
      def insert_stream(self, stream, src_format, resume_tokens):
          target = self.target_repo
          target._unstacked_provider.missing_keys.clear()
++        candidate_calls = [('Repository.insert_stream_1.19', (1, 19))]
          if target._lock_token:
--            verb = 'Repository.insert_stream_locked'
--            extra_args = (target._lock_token or '',)
--            required_version = (1, 14)
++            candidate_calls.append(('Repository.insert_stream_locked', (1, 14)))
++            lock_args = (target._lock_token or '',)
          else:
--            verb = 'Repository.insert_stream'
--            extra_args = ()
--            required_version = (1, 13)
++            candidate_calls.append(('Repository.insert_stream', (1, 13)))
++            lock_args = ()
          client = target._client
          medium = client._medium
--        if medium._is_remote_before(required_version):
--            # No possible way this can work.
--            return self._insert_real(stream, src_format, resume_tokens)
          path = target.bzrdir._path_for_remote_call(client)
--        if not resume_tokens:
--            # XXX: Ugly but important for correctness, *will* be fixed during
--            # 1.13 cycle. Pushing a stream that is interrupted results in a
--            # fallback to the _real_repositories sink *with a partial stream*.
--            # Thats bad because we insert less data than bzr expected. To avoid
--            # this we do a trial push to make sure the verb is accessible, and
--            # do not fallback when actually pushing the stream. A cleanup patch
--            # is going to look at rewinding/restarting the stream/partial
--            # buffering etc.
++        # Probe for the verb to use with an empty stream before sending the
++        # real stream to it.  We do this both to avoid the risk of sending a
++        # large request that is then rejected, and because we don't want to
++        # implement a way to buffer, rewind, or restart the stream.
++        found_verb = False
++        for verb, required_version in candidate_calls:
++            if medium._is_remote_before(required_version):
++                continue
++            if resume_tokens:
++                # We've already done the probing (and set _is_remote_before) on
++                # a previous insert.
++                found_verb = True
++                break
              byte_stream = smart_repo._stream_to_byte_stream([], src_format)
              try:
                  response = client.call_with_body_stream(
--                    (verb, path, '') + extra_args, byte_stream)
++                    (verb, path, '') + lock_args, byte_stream)
              except errors.UnknownSmartMethod:
                  medium._remember_remote_is_before(required_version)
--                return self._insert_real(stream, src_format, resume_tokens)
++            else:
++                found_verb = True
++                break
++        if not found_verb:
++            # Have to use VFS.
++            return self._insert_real(stream, src_format, resume_tokens)
++        self._last_inv_record = None
++        self._last_substream = None
++        if required_version < (1, 19):
++            # Remote side doesn't support inventory deltas.  Wrap the stream to
++            # make sure we don't send any.  If the stream contains inventory
++            # deltas we'll interrupt the smart insert_stream request and
++            # fallback to VFS.
++            stream = self._stop_stream_if_inventory_delta(stream)
          byte_stream = smart_repo._stream_to_byte_stream(
              stream, src_format)
          resume_tokens = ' '.join(resume_tokens)
          response = client.call_with_body_stream(
--            (verb, path, resume_tokens) + extra_args, byte_stream)
++            (verb, path, resume_tokens) + lock_args, byte_stream)
          if response[0][0] not in ('ok', 'missing-basis'):
              raise errors.UnexpectedSmartServerResponse(response)
++        if self._last_substream is not None:
++            # The stream included an inventory-delta record, but the remote
++            # side isn't new enough to support them.  So we need to send the
++            # rest of the stream via VFS.
++            return self._resume_stream_with_vfs(response, src_format)
          if response[0][0] == 'missing-basis':
              tokens, missing_keys = bencode.bdecode_as_tuple(response[0][1])
              resume_tokens = tokens
@@ -1727,6 +1753,46 @@
              self.target_repo.refresh_data()
              return [], set()
++    def _resume_stream_with_vfs(self, response, src_format):
++        """Resume sending a stream via VFS, first resending the record and
++        substream that couldn't be sent via an insert_stream verb.
++        """
++        if response[0][0] == 'missing-basis':
++            tokens, missing_keys = bencode.bdecode_as_tuple(response[0][1])
++            # Ignore missing_keys, we haven't finished inserting yet
++        else:
++            tokens = []
++        def resume_substream():
++            # Yield the substream that was interrupted.
++            for record in self._last_substream:
++                yield record
++            self._last_substream = None
++        def resume_stream():
++            # Finish sending the interrupted substream
++            yield ('inventory-deltas', resume_substream())
++            # Then simply continue sending the rest of the stream.
++            for substream_kind, substream in self._last_stream:
++                yield substream_kind, substream
++        return self._insert_real(resume_stream(), src_format, tokens)
++
++    def _stop_stream_if_inventory_delta(self, stream):
++        """Normally this just lets the original stream pass-through unchanged.
++
++        However if any 'inventory-deltas' substream occurs it will stop
++        streaming, and store the interrupted substream and stream in
++        self._last_substream and self._last_stream so that the stream can be
++        resumed by _resume_stream_with_vfs.
++        """
++
++        stream_iter = iter(stream)
++        for substream_kind, substream in stream_iter:
++            if substream_kind == 'inventory-deltas':
++                self._last_substream = substream
++                self._last_stream = stream_iter
++                return
++            else:
++                yield substream_kind, substream
++
  class RemoteStreamSource(repository.StreamSource):
      """Stream data from a remote server."""
@@ -1747,6 +1813,12 @@
              sources.append(repo)
          return self.missing_parents_chain(search, sources)
++    def get_stream_for_missing_keys(self, missing_keys):
++        self.from_repository._ensure_real()
++        real_repo = self.from_repository._real_repository
++        real_source = real_repo._get_source(self.to_format)
++        return real_source.get_stream_for_missing_keys(missing_keys)
++
      def _real_stream(self, repo, search):
          """Get a stream for search from repo.
@@ -1782,18 +1854,26 @@
              return self._real_stream(repo, search)
          client = repo._client
          medium = client._medium
--        if medium._is_remote_before((1, 13)):
--            # streaming was added in 1.13
--            return self._real_stream(repo, search)
          path = repo.bzrdir._path_for_remote_call(client)
--        try:
--            search_bytes = repo._serialise_search_result(search)
--            response = repo._call_with_body_bytes_expecting_body(
--                'Repository.get_stream',
--                (path, self.to_format.network_name()), search_bytes)
--            response_tuple, response_handler = response
--        except errors.UnknownSmartMethod:
--            medium._remember_remote_is_before((1,13))
++        search_bytes = repo._serialise_search_result(search)
++        args = (path, self.to_format.network_name())
++        candidate_verbs = [
++            ('Repository.get_stream_1.19', (1, 19)),
++            ('Repository.get_stream', (1, 13))]
++        found_verb = False
++        for verb, version in candidate_verbs:
++            if medium._is_remote_before(version):
++                continue
++            try:
++                response = repo._call_with_body_bytes_expecting_body(
++                    verb, args, search_bytes)
++            except errors.UnknownSmartMethod:
++                medium._remember_remote_is_before(version)
++            else:
++                response_tuple, response_handler = response
++                found_verb = True
++                break
++        if not found_verb:
              return self._real_stream(repo, search)
          if response_tuple[0] != 'ok':
              raise errors.UnexpectedSmartServerResponse(response_tuple)
 === modified file 'bzrlib/repofmt/groupcompress_repo.py'
 --- bzrlib/repofmt/groupcompress_repo.py	2009-08-12 18:58:05 +0000
 +++ bzrlib/repofmt/groupcompress_repo.py	2009-08-14 05:35:32 +0000
@@ -154,6 +154,8 @@
          self._writer.begin()
          # what state is the pack in? (open, finished, aborted)
          self._state = 'open'
++        # no name until we finish writing the content
++        self.name = None
      def _check_references(self):
          """Make sure our external references are present.
@@ -477,6 +479,15 @@
          if not self._use_pack(self.new_pack):
              self.new_pack.abort()
              return None
++        self.new_pack.finish_content()
++        if len(self.packs) == 1:
++            old_pack = self.packs[0]
++            if old_pack.name == self.new_pack._hash.hexdigest():
++                # The single old pack was already optimally packed.
++                trace.mutter('single pack %s was already optimally packed',
++                    old_pack.name)
++                self.new_pack.abort()
++                return None
          self.pb.update('finishing repack', 6, 7)
          self.new_pack.finish()
          self._pack_collection.allocate(self.new_pack)
@@ -591,7 +602,7 @@
              packer = GCCHKPacker(self, packs, '.autopack',
                                   reload_func=reload_func)
              try:
--                packer.pack()
++                result = packer.pack()
              except errors.RetryWithNewPacks:
                  # An exception is propagating out of this context, make sure
                  # this packer has cleaned up. Packer() doesn't set its new_pack
@@ -600,6 +611,8 @@
                  if packer.new_pack is not None:
                      packer.new_pack.abort()
                  raise
++            if result is None:
++                return
              for pack in packs:
                  self._remove_pack_from_memory(pack)
          # record the newly available packs and stop advertising the old
@@ -781,10 +794,12 @@
          return inventory.CHKInventory.deserialise(self.chk_bytes, bytes,
              (revision_id,))
--    def _iter_inventories(self, revision_ids):
++    def _iter_inventories(self, revision_ids, ordering):
          """Iterate over many inventory objects."""
++        if ordering is None:
++            ordering = 'unordered'
          keys = [(revision_id,) for revision_id in revision_ids]
--        stream = self.inventories.get_record_stream(keys, 'unordered', True)
++        stream = self.inventories.get_record_stream(keys, ordering, True)
          texts = {}
          for record in stream:
              if record.storage_kind != 'absent':
@@ -794,7 +809,7 @@
          for key in keys:
              yield inventory.CHKInventory.deserialise(self.chk_bytes, texts[key], key)
--    def _iter_inventory_xmls(self, revision_ids):
++    def _iter_inventory_xmls(self, revision_ids, ordering):
          # Without a native 'xml' inventory, this method doesn't make sense, so
          # make it raise to trap naughty direct users.
          raise NotImplementedError(self._iter_inventory_xmls)
@@ -894,14 +909,11 @@
      def _get_source(self, to_format):
          """Return a source for streaming from this repository."""
--        if isinstance(to_format, remote.RemoteRepositoryFormat):
--            # Can't just check attributes on to_format with the current code,
--            # work around this:
--            to_format._ensure_real()
--            to_format = to_format._custom_format
--        if to_format.__class__ is self._format.__class__:
++        if self._format._serializer == to_format._serializer:
              # We must be exactly the same format, otherwise stuff like the chk
--            # page layout might be different
++            # page layout might be different.
++            # Actually, this test is just slightly looser than exact so that
++            # CHK2 <-> 2a transfers will work.
              return GroupCHKStreamSource(self, to_format)
          return super(CHKInventoryRepository, self)._get_source(to_format)
 === modified file 'bzrlib/repofmt/pack_repo.py'
 --- bzrlib/repofmt/pack_repo.py	2009-08-05 01:05:58 +0000
 +++ bzrlib/repofmt/pack_repo.py	2009-08-14 05:35:32 +0000
@@ -422,6 +422,8 @@
          self._writer.begin()
          # what state is the pack in? (open, finished, aborted)
          self._state = 'open'
++        # no name until we finish writing the content
++        self.name = None
      def abort(self):
          """Cancel creating this pack."""
@@ -448,6 +450,14 @@
              self.signature_index.key_count() or
              (self.chk_index is not None and self.chk_index.key_count()))
++    def finish_content(self):
++        if self.name is not None:
++            return
++        self._writer.end()
++        if self._buffer[1]:
++            self._write_data('', flush=True)
++        self.name = self._hash.hexdigest()
++
      def finish(self, suspend=False):
          """Finish the new pack.
@@ -459,10 +469,7 @@
           - stores the index size tuple for the pack in the index_sizes
             attribute.
          """
--        self._writer.end()
--        if self._buffer[1]:
--            self._write_data('', flush=True)
--        self.name = self._hash.hexdigest()
++        self.finish_content()
          if not suspend:
              self._check_references()
          # write indices
@@ -1567,7 +1574,9 @@
          # determine which packs need changing
          pack_operations = [[0, []]]
          for pack in self.all_packs():
--            if not hint or pack.name in hint:
++            if hint is None or pack.name in hint:
++                # Either no hint was provided (so we are packing everything),
++                # or this pack was included in the hint.
                  pack_operations[-1][0] += pack.get_revision_count()
                  pack_operations[-1][1].append(pack)
          self._execute_pack_operations(pack_operations, OptimisingPacker)
@@ -2093,6 +2102,7 @@
                  # when autopack takes no steps, the names list is still
                  # unsaved.
                  return self._save_pack_names()
++        return []
      def _suspend_write_group(self):
          tokens = [pack.name for pack in self._resumed_packs]
 === modified file 'bzrlib/repository.py'
 --- bzrlib/repository.py	2009-08-12 22:28:28 +0000
 +++ bzrlib/repository.py	2009-08-14 05:35:32 +0000
@@ -31,6 +31,7 @@
      gpg,
      graph,
      inventory,
++    inventory_delta,
      lazy_regex,
      lockable_files,
      lockdir,
@@ -924,6 +925,11 @@
          """
          if self._write_group is not self.get_transaction():
              # has an unlock or relock occured ?
++            if suppress_errors:
++                mutter(
++                '(suppressed) mismatched lock context and write group. %r, %r',
++                self._write_group, self.get_transaction())
++                return
              raise errors.BzrError(
                  'mismatched lock context and write group. %r, %r' %
                  (self._write_group, self.get_transaction()))
@@ -1063,8 +1069,10 @@
          check_content=True):
          """Store lines in inv_vf and return the sha1 of the inventory."""
          parents = [(parent,) for parent in parents]
--        return self.inventories.add_lines((revision_id,), parents, lines,
++        result = self.inventories.add_lines((revision_id,), parents, lines,
              check_content=check_content)[0]
++        self.inventories._access.flush()
++        return result
      def add_revision(self, revision_id, rev, inv=None, config=None):
          """Add rev to the revision store as revision_id.
@@ -1529,6 +1537,8 @@
          """Commit the contents accrued within the current write group.
          :seealso: start_write_group.
++
++        :return: it may return an opaque hint that can be passed to 'pack'.
          """
          if self._write_group is not self.get_transaction():
              # has an unlock or relock occured ?
@@ -2340,7 +2350,7 @@
          """Get Inventory object by revision id."""
          return self.iter_inventories([revision_id]).next()
--    def iter_inventories(self, revision_ids):
++    def iter_inventories(self, revision_ids, ordering=None):
          """Get many inventories by revision_ids.
          This will buffer some or all of the texts used in constructing the
@@ -2348,30 +2358,57 @@
          time.
          :param revision_ids: The expected revision ids of the inventories.
++        :param ordering: optional ordering, e.g. 'topological'.  If not
++            specified, the order of revision_ids will be preserved (by
++            buffering if necessary).
          :return: An iterator of inventories.
          """
          if ((None in revision_ids)
              or (_mod_revision.NULL_REVISION in revision_ids)):
              raise ValueError('cannot get null revision inventory')
--        return self._iter_inventories(revision_ids)
++        return self._iter_inventories(revision_ids, ordering)
--    def _iter_inventories(self, revision_ids):
++    def _iter_inventories(self, revision_ids, ordering):
          """single-document based inventory iteration."""
--        for text, revision_id in self._iter_inventory_xmls(revision_ids):
++        inv_xmls = self._iter_inventory_xmls(revision_ids, ordering)
++        for text, revision_id in inv_xmls:
              yield self.deserialise_inventory(revision_id, text)
--    def _iter_inventory_xmls(self, revision_ids):
++    def _iter_inventory_xmls(self, revision_ids, ordering):
++        if ordering is None:
++            order_as_requested = True
++            ordering = 'unordered'
++        else:
++            order_as_requested = False
          keys = [(revision_id,) for revision_id in revision_ids]
--        stream = self.inventories.get_record_stream(keys, 'unordered', True)
++        if not keys:
++            return
++        if order_as_requested:
++            key_iter = iter(keys)
++            next_key = key_iter.next()
++        stream = self.inventories.get_record_stream(keys, ordering, True)
          text_chunks = {}
          for record in stream:
              if record.storage_kind != 'absent':
--                text_chunks[record.key] = record.get_bytes_as('chunked')
++                chunks = record.get_bytes_as('chunked')
++                if order_as_requested:
++                    text_chunks[record.key] = chunks
++                else:
++                    yield ''.join(chunks), record.key[-1]
              else:
                  raise errors.NoSuchRevision(self, record.key)
--        for key in keys:
--            chunks = text_chunks.pop(key)
--            yield ''.join(chunks), key[-1]
++            if order_as_requested:
++                # Yield as many results as we can while preserving order.
++                while next_key in text_chunks:
++                    chunks = text_chunks.pop(next_key)
++                    yield ''.join(chunks), next_key[-1]
++                    try:
++                        next_key = key_iter.next()
++                    except StopIteration:
++                        # We still want to fully consume the get_record_stream,
++                        # just in case it is not actually finished at this point
++                        next_key = None
++                        break
      def deserialise_inventory(self, revision_id, xml):
          """Transform the xml into an inventory object.
@@ -2398,7 +2435,7 @@
      @needs_read_lock
      def get_inventory_xml(self, revision_id):
          """Get inventory XML as a file object."""
--        texts = self._iter_inventory_xmls([revision_id])
++        texts = self._iter_inventory_xmls([revision_id], 'unordered')
          try:
              text, revision_id = texts.next()
          except StopIteration:
@@ -3664,11 +3701,25 @@
          # This is redundant with format.check_conversion_target(), however that
          # raises an exception, and we just want to say "False" as in we won't
          # support converting between these formats.
++        if 'IDS_never' in debug.debug_flags:
++            return False
          if source.supports_rich_root() and not target.supports_rich_root():
              return False
          if (source._format.supports_tree_reference
              and not target._format.supports_tree_reference):
              return False
++        if target._fallback_repositories and target._format.supports_chks:
++            # IDS doesn't know how to copy CHKs for the parent inventories it
++            # adds to stacked repos.
++            return False
++        if 'IDS_always' in debug.debug_flags:
++            return True
++        # Only use this code path for local source and target.  IDS does far
++        # too much IO (both bandwidth and roundtrips) over a network.
++        if not source.bzrdir.transport.base.startswith('file:///'):
++            return False
++        if not target.bzrdir.transport.base.startswith('file:///'):
++            return False
          return True
      def _get_delta_for_revision(self, tree, parent_ids, basis_id, cache):
@@ -3690,63 +3741,6 @@
          deltas.sort()
          return deltas[0][1:]
--    def _get_parent_keys(self, root_key, parent_map):
--        """Get the parent keys for a given root id."""
--        root_id, rev_id = root_key
--        # Include direct parents of the revision, but only if they used
--        # the same root_id and are heads.
--        parent_keys = []
--        for parent_id in parent_map[rev_id]:
--            if parent_id == _mod_revision.NULL_REVISION:
--                continue
--            if parent_id not in self._revision_id_to_root_id:
--                # We probably didn't read this revision, go spend the
--                # extra effort to actually check
--                try:
--                    tree = self.source.revision_tree(parent_id)
--                except errors.NoSuchRevision:
--                    # Ghost, fill out _revision_id_to_root_id in case we
--                    # encounter this again.
--                    # But set parent_root_id to None since we don't really know
--                    parent_root_id = None
--                else:
--                    parent_root_id = tree.get_root_id()
--                self._revision_id_to_root_id[parent_id] = None
--            else:
--                parent_root_id = self._revision_id_to_root_id[parent_id]
--            if root_id == parent_root_id:
--                # With stacking we _might_ want to refer to a non-local
--                # revision, but this code path only applies when we have the
--                # full content available, so ghosts really are ghosts, not just
--                # the edge of local data.
--                parent_keys.append((parent_id,))
--            else:
--                # root_id may be in the parent anyway.
--                try:
--                    tree = self.source.revision_tree(parent_id)
--                except errors.NoSuchRevision:
--                    # ghost, can't refer to it.
--                    pass
--                else:
--                    try:
--                        parent_keys.append((tree.inventory[root_id].revision,))
--                    except errors.NoSuchId:
--                        # not in the tree
--                        pass
--        g = graph.Graph(self.source.revisions)
--        heads = g.heads(parent_keys)
--        selected_keys = []
--        for key in parent_keys:
--            if key in heads and key not in selected_keys:
--                selected_keys.append(key)
--        return tuple([(root_id,)+ key for key in selected_keys])
--
--    def _new_root_data_stream(self, root_keys_to_create, parent_map):
--        for root_key in root_keys_to_create:
--            parent_keys = self._get_parent_keys(root_key, parent_map)
--            yield versionedfile.FulltextContentFactory(root_key,
--                parent_keys, None, '')
--
      def _fetch_batch(self, revision_ids, basis_id, cache):
          """Fetch across a few revisions.
@@ -3798,8 +3792,10 @@
          from_texts = self.source.texts
          to_texts = self.target.texts
          if root_keys_to_create:
--            root_stream = self._new_root_data_stream(root_keys_to_create,
--                                                     parent_map)
++            from bzrlib.fetch import _new_root_data_stream
++            root_stream = _new_root_data_stream(
++                root_keys_to_create, self._revision_id_to_root_id, parent_map,
++                self.source)
              to_texts.insert_record_stream(root_stream)
          to_texts.insert_record_stream(from_texts.get_record_stream(
              text_keys, self.target._format._fetch_order,
@@ -3899,7 +3895,6 @@
          # Walk though all revisions; get inventory deltas, copy referenced
          # texts that delta references, insert the delta, revision and
          # signature.
--        first_rev = self.source.get_revision(revision_ids[0])
          if pb is None:
              my_pb = ui.ui_factory.nested_progress_bar()
              pb = my_pb
@@ -4071,9 +4066,6 @@
          self.file_ids = set([file_id for file_id, _ in
              self.text_index.iterkeys()])
          # text keys is now grouped by file_id
--        n_weaves = len(self.file_ids)
--        files_in_revisions = {}
--        revisions_of_files = {}
          n_versions = len(self.text_index)
          progress_bar.update('loading text store', 0, n_versions)
          parent_map = self.repository.texts.get_parent_map(self.text_index)
@@ -4172,6 +4164,8 @@
              else:
                  new_pack.set_write_cache_size(1024*1024)
          for substream_type, substream in stream:
++            if 'stream' in debug.debug_flags:
++                mutter('inserting substream: %s', substream_type)
              if substream_type == 'texts':
                  self.target_repo.texts.insert_record_stream(substream)
              elif substream_type == 'inventories':
@@ -4181,6 +4175,9 @@
                  else:
                      self._extract_and_insert_inventories(
                          substream, src_serializer)
++            elif substream_type == 'inventory-deltas':
++                self._extract_and_insert_inventory_deltas(
++                    substream, src_serializer)
              elif substream_type == 'chk_bytes':
                  # XXX: This doesn't support conversions, as it assumes the
                  #      conversion was done in the fetch code.
@@ -4237,18 +4234,45 @@
              self.target_repo.pack(hint=hint)
          return [], set()
--    def _extract_and_insert_inventories(self, substream, serializer):
++    def _extract_and_insert_inventory_deltas(self, substream, serializer):
++        target_rich_root = self.target_repo._format.rich_root_data
++        target_tree_refs = self.target_repo._format.supports_tree_reference
++        for record in substream:
++            # Insert the delta directly
++            inventory_delta_bytes = record.get_bytes_as('fulltext')
++            deserialiser = inventory_delta.InventoryDeltaDeserializer()
++            try:
++                parse_result = deserialiser.parse_text_bytes(
++                    inventory_delta_bytes)
++            except inventory_delta.IncompatibleInventoryDelta, err:
++                trace.mutter("Incompatible delta: %s", err.msg)
++                raise errors.IncompatibleRevision(self.target_repo._format)
++            basis_id, new_id, rich_root, tree_refs, inv_delta = parse_result
++            revision_id = new_id
++            parents = [key[0] for key in record.parents]
++            self.target_repo.add_inventory_by_delta(
++                basis_id, inv_delta, revision_id, parents)
++
++    def _extract_and_insert_inventories(self, substream, serializer,
++            parse_delta=None):
          """Generate a new inventory versionedfile in target, converting data.
          The inventory is retrieved from the source, (deserializing it), and
          stored in the target (reserializing it in a different format).
          """
++        target_rich_root = self.target_repo._format.rich_root_data
++        target_tree_refs = self.target_repo._format.supports_tree_reference
          for record in substream:
++            # It's not a delta, so it must be a fulltext in the source
++            # serializer's format.
              bytes = record.get_bytes_as('fulltext')
              revision_id = record.key[0]
              inv = serializer.read_inventory_from_string(bytes, revision_id)
              parents = [key[0] for key in record.parents]
              self.target_repo.add_inventory(revision_id, inv, parents)
++            # No need to keep holding this full inv in memory when the rest of
++            # the substream is likely to be all deltas.
++            del inv
      def _extract_and_insert_revisions(self, substream, serializer):
          for record in substream:
@@ -4303,11 +4327,8 @@
          return [('signatures', signatures), ('revisions', revisions)]
      def _generate_root_texts(self, revs):
--        """This will be called by __fetch between fetching weave texts and
++        """This will be called by get_stream between fetching weave texts and
          fetching the inventory weave.
--
--        Subclasses should override this if they need to generate root texts
--        after fetching weave texts.
          """
          if self._rich_root_upgrade():
              import bzrlib.fetch
@@ -4345,9 +4366,6 @@
                  # will be valid.
                  for _ in self._generate_root_texts(revs):
                      yield _
--                # NB: This currently reopens the inventory weave in source;
--                # using a single stream interface instead would avoid this.
--                from_weave = self.from_repository.inventories
                  # we fetch only the referenced inventories because we do not
                  # know for unselected inventories whether all their required
                  # texts are present in the other repository - it could be
@@ -4392,6 +4410,18 @@
              if not keys:
                  # No need to stream something we don't have
                  continue
++            if substream_kind == 'inventories':
++                # Some missing keys are genuinely ghosts, filter those out.
++                present = self.from_repository.inventories.get_parent_map(keys)
++                revs = [key[0] for key in present]
++                # Get the inventory stream more-or-less as we do for the
++                # original stream; there's no reason to assume that records
++                # direct from the source will be suitable for the sink.  (Think
++                # e.g. 2a -> 1.9-rich-root).
++                for info in self._get_inventory_stream(revs, missing=True):
++                    yield info
++                continue
++
              # Ask for full texts always so that we don't need more round trips
              # after this stream.
              # Some of the missing keys are genuinely ghosts, so filter absent
@@ -4412,129 +4442,116 @@
          return (not self.from_repository._format.rich_root_data and
              self.to_format.rich_root_data)
--    def _get_inventory_stream(self, revision_ids):
++    def _get_inventory_stream(self, revision_ids, missing=False):
          from_format = self.from_repository._format
--        if (from_format.supports_chks and self.to_format.supports_chks
--            and (from_format._serializer == self.to_format._serializer)):
--            # Both sides support chks, and they use the same serializer, so it
--            # is safe to transmit the chk pages and inventory pages across
--            # as-is.
--            return self._get_chk_inventory_stream(revision_ids)
--        elif (not from_format.supports_chks):
--            # Source repository doesn't support chks. So we can transmit the
--            # inventories 'as-is' and either they are just accepted on the
--            # target, or the Sink will properly convert it.
--            return self._get_simple_inventory_stream(revision_ids)
++        if (from_format.supports_chks and self.to_format.supports_chks and
++            from_format.network_name() == self.to_format.network_name()):
++            raise AssertionError(
++                "this case should be handled by GroupCHKStreamSource")
++        elif 'forceinvdeltas' in debug.debug_flags:
++            return self._get_convertable_inventory_stream(revision_ids,
++                    delta_versus_null=missing)
++        elif from_format.network_name() == self.to_format.network_name():
++            # Same format.
++            return self._get_simple_inventory_stream(revision_ids,
++                    missing=missing)
++        elif (not from_format.supports_chks and not self.to_format.supports_chks
++                and from_format._serializer == self.to_format._serializer):
++            # Essentially the same format.
++            return self._get_simple_inventory_stream(revision_ids,
++                    missing=missing)
          else:
--            # XXX: Hack to make not-chk->chk fetch: copy the inventories as
--            #      inventories. Note that this should probably be done somehow
--            #      as part of bzrlib.repository.StreamSink. Except JAM couldn't
--            #      figure out how a non-chk repository could possibly handle
--            #      deserializing an inventory stream from a chk repo, as it
--            #      doesn't have a way to understand individual pages.
--            return self._get_convertable_inventory_stream(revision_ids)
++            # Any time we switch serializations, we want to use an
++            # inventory-delta based approach.
++            return self._get_convertable_inventory_stream(revision_ids,
++                    delta_versus_null=missing)
--    def _get_simple_inventory_stream(self, revision_ids):
++    def _get_simple_inventory_stream(self, revision_ids, missing=False):
++        # NB: This currently reopens the inventory weave in source;
++        # using a single stream interface instead would avoid this.
          from_weave = self.from_repository.inventories
++        if missing:
++            delta_closure = True
++        else:
++            delta_closure = not self.delta_on_metadata()
          yield ('inventories', from_weave.get_record_stream(
              [(rev_id,) for rev_id in revision_ids],
--            self.inventory_fetch_order(),
--            not self.delta_on_metadata()))
--
--    def _get_chk_inventory_stream(self, revision_ids):
--        """Fetch the inventory texts, along with the associated chk maps."""
--        # We want an inventory outside of the search set, so that we can filter
--        # out uninteresting chk pages. For now we use
--        # _find_revision_outside_set, but if we had a Search with cut_revs, we
--        # could use that instead.
--        start_rev_id = self.from_repository._find_revision_outside_set(
--                            revision_ids)
--        start_rev_key = (start_rev_id,)
--        inv_keys_to_fetch = [(rev_id,) for rev_id in revision_ids]
--        if start_rev_id != _mod_revision.NULL_REVISION:
--            inv_keys_to_fetch.append((start_rev_id,))
--        # Any repo that supports chk_bytes must also support out-of-order
--        # insertion. At least, that is how we expect it to work
--        # We use get_record_stream instead of iter_inventories because we want
--        # to be able to insert the stream as well. We could instead fetch
--        # allowing deltas, and then iter_inventories, but we don't know whether
--        # source or target is more 'local' anway.
--        inv_stream = self.from_repository.inventories.get_record_stream(
--            inv_keys_to_fetch, 'unordered',
--            True) # We need them as full-texts so we can find their references
--        uninteresting_chk_roots = set()
--        interesting_chk_roots = set()
--        def filter_inv_stream(inv_stream):
--            for idx, record in enumerate(inv_stream):
--                ### child_pb.update('fetch inv', idx, len(inv_keys_to_fetch))
--                bytes = record.get_bytes_as('fulltext')
--                chk_inv = inventory.CHKInventory.deserialise(
--                    self.from_repository.chk_bytes, bytes, record.key)
--                if record.key == start_rev_key:
--                    uninteresting_chk_roots.add(chk_inv.id_to_entry.key())
--                    p_id_map = chk_inv.parent_id_basename_to_file_id
--                    if p_id_map is not None:
--                        uninteresting_chk_roots.add(p_id_map.key())
--                else:
--                    yield record
--                    interesting_chk_roots.add(chk_inv.id_to_entry.key())
--                    p_id_map = chk_inv.parent_id_basename_to_file_id
--                    if p_id_map is not None:
--                        interesting_chk_roots.add(p_id_map.key())
--        ### pb.update('fetch inventory', 0, 2)
--        yield ('inventories', filter_inv_stream(inv_stream))
--        # Now that we have worked out all of the interesting root nodes, grab
--        # all of the interesting pages and insert them
--        ### pb.update('fetch inventory', 1, 2)
--        interesting = chk_map.iter_interesting_nodes(
--            self.from_repository.chk_bytes, interesting_chk_roots,
--            uninteresting_chk_roots)
--        def to_stream_adapter():
--            """Adapt the iter_interesting_nodes result to a single stream.
--
--            iter_interesting_nodes returns records as it processes them, along
--            with keys. However, we only want to return the records themselves.
--            """
--            for record, items in interesting:
--                if record is not None:
--                    yield record
--        # XXX: We could instead call get_record_stream(records.keys())
--        #      ATM, this will always insert the records as fulltexts, and
--        #      requires that you can hang on to records once you have gone
--        #      on to the next one. Further, it causes the target to
--        #      recompress the data. Testing shows it to be faster than
--        #      requesting the records again, though.
--        yield ('chk_bytes', to_stream_adapter())
--        ### pb.update('fetch inventory', 2, 2)
--
--    def _get_convertable_inventory_stream(self, revision_ids):
--        # XXX: One of source or target is using chks, and they don't have
--        #      compatible serializations. The StreamSink code expects to be
--        #      able to convert on the target, so we need to put
--        #      bytes-on-the-wire that can be converted
--        yield ('inventories', self._stream_invs_as_fulltexts(revision_ids))
--
--    def _stream_invs_as_fulltexts(self, revision_ids):
++            self.inventory_fetch_order(), delta_closure))
++
++    def _get_convertable_inventory_stream(self, revision_ids,
++                                          delta_versus_null=False):
++        # The source is using CHKs, but the target either doesn't or it has a
++        # different serializer.  The StreamSink code expects to be able to
++        # convert on the target, so we need to put bytes-on-the-wire that can
++        # be converted.  That means inventory deltas (if the remote is <1.19,
++        # RemoteStreamSink will fallback to VFS to insert the deltas).
++        yield ('inventory-deltas',
++           self._stream_invs_as_deltas(revision_ids,
++                                       delta_versus_null=delta_versus_null))
++
++    def _stream_invs_as_deltas(self, revision_ids, delta_versus_null=False):
++        """Return a stream of inventory-deltas for the given rev ids.
++
++        :param revision_ids: The list of inventories to transmit
++        :param delta_versus_null: Don't try to find a minimal delta for this
++            entry, instead compute the delta versus the NULL_REVISION. This
++            effectively streams a complete inventory. Used for stuff like
++            filling in missing parents, etc.
++        """
          from_repo = self.from_repository
--        from_serializer = from_repo._format._serializer
          revision_keys = [(rev_id,) for rev_id in revision_ids]
          parent_map = from_repo.inventories.get_parent_map(revision_keys)
--        for inv in self.from_repository.iter_inventories(revision_ids):
--            # XXX: This is a bit hackish, but it works. Basically,
--            #      CHKSerializer 'accidentally' supports
--            #      read/write_inventory_to_string, even though that is never
--            #      the format that is stored on disk. It *does* give us a
--            #      single string representation for an inventory, so live with
--            #      it for now.
--            #      This would be far better if we had a 'serialized inventory
--            #      delta' form. Then we could use 'inventory._make_delta', and
--            #      transmit that. This would both be faster to generate, and
--            #      result in fewer bytes-on-the-wire.
--            as_bytes = from_serializer.write_inventory_to_string(inv)
++        # XXX: possibly repos could implement a more efficient iter_inv_deltas
++        # method...
++        inventories = self.from_repository.iter_inventories(
++            revision_ids, 'topological')
++        format = from_repo._format
++        invs_sent_so_far = set([_mod_revision.NULL_REVISION])
++        inventory_cache = lru_cache.LRUCache(50)
++        null_inventory = from_repo.revision_tree(
++            _mod_revision.NULL_REVISION).inventory
++        # XXX: ideally the rich-root/tree-refs flags would be per-revision, not
++        # per-repo (e.g.  streaming a non-rich-root revision out of a rich-root
++        # repo back into a non-rich-root repo ought to be allowed)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=format.rich_root_data,
++            tree_references=format.supports_tree_reference)
++        for inv in inventories:
              key = (inv.revision_id,)
              parent_keys = parent_map.get(key, ())
++            delta = None
++            if not delta_versus_null and parent_keys:
++                # The caller did not ask for complete inventories and we have
++                # some parents that we can delta against.  Make a delta against
++                # each parent so that we can find the smallest.
++                parent_ids = [parent_key[0] for parent_key in parent_keys]
++                for parent_id in parent_ids:
++                    if parent_id not in invs_sent_so_far:
++                        # We don't know that the remote side has this basis, so
++                        # we can't use it.
++                        continue
++                    if parent_id == _mod_revision.NULL_REVISION:
++                        parent_inv = null_inventory
++                    else:
++                        parent_inv = inventory_cache.get(parent_id, None)
++                        if parent_inv is None:
++                            parent_inv = from_repo.get_inventory(parent_id)
++                    candidate_delta = inv._make_delta(parent_inv)
++                    if (delta is None or
++                        len(delta) > len(candidate_delta)):
++                        delta = candidate_delta
++                        basis_id = parent_id
++            if delta is None:
++                # Either none of the parents ended up being suitable, or we
++                # were asked to delta against NULL
++                basis_id = _mod_revision.NULL_REVISION
++                delta = inv._make_delta(null_inventory)
++            invs_sent_so_far.add(inv.revision_id)
++            inventory_cache[inv.revision_id] = inv
++            delta_serialized = ''.join(
++                serializer.delta_to_lines(basis_id, key[-1], delta))
              yield versionedfile.FulltextContentFactory(
--                key, parent_keys, None, as_bytes)
++                key, parent_keys, None, delta_serialized)
  def _iter_for_revno(repo, partial_history_cache, stop_index=None,
 === modified file 'bzrlib/smart/protocol.py'
 --- bzrlib/smart/protocol.py	2009-07-17 01:48:56 +0000
 +++ bzrlib/smart/protocol.py	2009-08-14 05:35:32 +0000
@@ -1209,6 +1209,8 @@
          except (KeyboardInterrupt, SystemExit):
              raise
          except Exception:
++            mutter('_iter_with_errors caught error')
++            log_exception_quietly()
              yield sys.exc_info(), None
              return
 === modified file 'bzrlib/smart/repository.py'
 --- bzrlib/smart/repository.py	2009-06-16 06:46:32 +0000
 +++ bzrlib/smart/repository.py	2009-08-14 05:35:32 +0000
@@ -30,6 +30,7 @@
      graph,
      osutils,
      pack,
++    versionedfile,
+     )
  from bzrlib.bzrdir import BzrDir
  from bzrlib.smart.request import (
@@ -39,7 +40,10 @@
+     )
  from bzrlib.repository import _strip_NULL_ghosts, network_format_registry
  from bzrlib import revision as _mod_revision
--from bzrlib.versionedfile import NetworkRecordStream, record_to_fulltext_bytes
++from bzrlib.versionedfile import (
++    NetworkRecordStream,
++    record_to_fulltext_bytes,
++    )
  class SmartServerRepositoryRequest(SmartServerRequest):
@@ -414,8 +418,42 @@
              repository.
          """
          self._to_format = network_format_registry.get(to_network_name)
++        if self._should_fake_unknown():
++            return FailedSmartServerResponse(
++                ('UnknownMethod', 'Repository.get_stream'))
          return None # Signal that we want a body.
++    def _should_fake_unknown(self):
++        """Return True if we should return UnknownMethod to the client.
++
++        This is a workaround for bugs in pre-1.19 clients that claim to
++        support receiving streams of CHK repositories.  The pre-1.19 client
++        expects inventory records to be serialized in the format defined by
++        to_network_name, but in pre-1.19 (at least) that format definition
++        tries to use the xml5 serializer, which does not correctly handle
++        rich-roots.  After 1.19 the client can also accept inventory-deltas
++        (which avoids this issue), and those clients will use the
++        Repository.get_stream_1.19 verb instead of this one.
++        So: if this repository is CHK, and the to_format doesn't match,
++        we should just fake an UnknownSmartMethod error so that the client
++        will fallback to VFS, rather than sending it a stream we know it
++        cannot handle.
++        """
++        from_format = self._repository._format
++        to_format = self._to_format
++        if not from_format.supports_chks:
++            # Source not CHK: that's ok
++            return False
++        if (to_format.supports_chks and
++            from_format.repository_class is to_format.repository_class and
++            from_format._serializer == to_format._serializer):
++            # Source is CHK, but target matches: that's ok
++            # (e.g. 2a->2a, or CHK2->2a)
++            return False
++        # Source is CHK, and target is not CHK or incompatible CHK.  We can't
++        # generate a compatible stream.
++        return True
++
      def do_body(self, body_bytes):
          repository = self._repository
          repository.lock_read()
@@ -451,6 +489,13 @@
              repository.unlock()
++class SmartServerRepositoryGetStream_1_19(SmartServerRepositoryGetStream):
++
++    def _should_fake_unknown(self):
++        """Returns False; we don't need to workaround bugs in 1.19+ clients."""
++        return False
++
++
  def _stream_to_byte_stream(stream, src_format):
      """Convert a record stream to a self delimited byte stream."""
      pack_writer = pack.ContainerSerialiser()
@@ -460,6 +505,8 @@
          for record in substream:
              if record.storage_kind in ('chunked', 'fulltext'):
                  serialised = record_to_fulltext_bytes(record)
++            elif record.storage_kind == 'inventory-delta':
++                serialised = record_to_inventory_delta_bytes(record)
              elif record.storage_kind == 'absent':
                  raise ValueError("Absent factory for %s" % (record.key,))
              else:
@@ -650,6 +697,23 @@
              return SuccessfulSmartServerResponse(('ok', ))
++class SmartServerRepositoryInsertStream_1_19(SmartServerRepositoryInsertStreamLocked):
++    """Insert a record stream from a RemoteSink into a repository.
++
++    Same as SmartServerRepositoryInsertStreamLocked, except:
++     - the lock token argument is optional
++     - servers that implement this verb accept 'inventory-delta' records in the
++       stream.
++
++    New in 1.19.
++    """
++
++    def do_repository_request(self, repository, resume_tokens, lock_token=None):
++        """StreamSink.insert_stream for a remote repository."""
++        SmartServerRepositoryInsertStreamLocked.do_repository_request(
++            self, repository, resume_tokens, lock_token)
++
++
  class SmartServerRepositoryInsertStream(SmartServerRepositoryInsertStreamLocked):
      """Insert a record stream from a RemoteSink into an unlocked repository.
 === modified file 'bzrlib/smart/request.py'
 --- bzrlib/smart/request.py	2009-07-27 02:06:05 +0000
 +++ bzrlib/smart/request.py	2009-08-14 05:35:32 +0000
@@ -553,6 +553,8 @@
  request_handlers.register_lazy(
      'Repository.insert_stream', 'bzrlib.smart.repository', 'SmartServerRepositoryInsertStream')
  request_handlers.register_lazy(
++    'Repository.insert_stream_1.19', 'bzrlib.smart.repository', 'SmartServerRepositoryInsertStream_1_19')
++request_handlers.register_lazy(
      'Repository.insert_stream_locked', 'bzrlib.smart.repository', 'SmartServerRepositoryInsertStreamLocked')
  request_handlers.register_lazy(
      'Repository.is_shared', 'bzrlib.smart.repository', 'SmartServerRepositoryIsShared')
@@ -570,6 +572,9 @@
      'Repository.get_stream', 'bzrlib.smart.repository',
      'SmartServerRepositoryGetStream')
  request_handlers.register_lazy(
++    'Repository.get_stream_1.19', 'bzrlib.smart.repository',
++    'SmartServerRepositoryGetStream_1_19')
++request_handlers.register_lazy(
      'Repository.tarball', 'bzrlib.smart.repository',
      'SmartServerRepositoryTarball')
  request_handlers.register_lazy(
 === modified file 'bzrlib/tests/__init__.py'
 --- bzrlib/tests/__init__.py	2009-08-12 18:49:22 +0000
 +++ bzrlib/tests/__init__.py	2009-08-14 05:35:32 +0000
@@ -1938,6 +1938,16 @@
          sio.encoding = output_encoding
          return sio
++    def disable_verb(self, verb):
++        """Disable a smart server verb for one test."""
++        from bzrlib.smart import request
++        request_handlers = request.request_handlers
++        orig_method = request_handlers.get(verb)
++        request_handlers.remove(verb)
++        def restoreVerb():
++            request_handlers.register(verb, orig_method)
++        self.addCleanup(restoreVerb)
++
  class CapturedCall(object):
      """A helper for capturing smart server calls for easy debug analysis."""
 === modified file 'bzrlib/tests/per_branch/test_push.py'
 --- bzrlib/tests/per_branch/test_push.py	2009-08-05 02:30:59 +0000
 +++ bzrlib/tests/per_branch/test_push.py	2009-08-14 05:35:32 +0000
@@ -261,14 +261,15 @@
          self.assertFalse(local.is_locked())
          local.push(remote)
          hpss_call_names = [item.call.method for item in self.hpss_calls]
--        self.assertTrue('Repository.insert_stream' in hpss_call_names)
--        insert_stream_idx = hpss_call_names.index('Repository.insert_stream')
++        self.assertTrue('Repository.insert_stream_1.19' in hpss_call_names)
++        insert_stream_idx = hpss_call_names.index(
++            'Repository.insert_stream_1.19')
          calls_after_insert_stream = hpss_call_names[insert_stream_idx:]
          # After inserting the stream the client has no reason to query the
          # remote graph any further.
          self.assertEqual(
--            ['Repository.insert_stream', 'Repository.insert_stream', 'get',
--             'Branch.set_last_revision_info', 'Branch.unlock'],
++            ['Repository.insert_stream_1.19', 'Repository.insert_stream_1.19',
++             'get', 'Branch.set_last_revision_info', 'Branch.unlock'],
              calls_after_insert_stream)
      def disableOptimisticGetParentMap(self):
 === modified file 'bzrlib/tests/per_interbranch/test_push.py'
 --- bzrlib/tests/per_interbranch/test_push.py	2009-08-05 02:30:59 +0000
 +++ bzrlib/tests/per_interbranch/test_push.py	2009-08-14 05:35:32 +0000
@@ -267,14 +267,15 @@
          self.assertFalse(local.is_locked())
          local.push(remote)
          hpss_call_names = [item.call.method for item in self.hpss_calls]
--        self.assertTrue('Repository.insert_stream' in hpss_call_names)
--        insert_stream_idx = hpss_call_names.index('Repository.insert_stream')
++        self.assertTrue('Repository.insert_stream_1.19' in hpss_call_names)
++        insert_stream_idx = hpss_call_names.index(
++            'Repository.insert_stream_1.19')
          calls_after_insert_stream = hpss_call_names[insert_stream_idx:]
          # After inserting the stream the client has no reason to query the
          # remote graph any further.
          self.assertEqual(
--            ['Repository.insert_stream', 'Repository.insert_stream', 'get',
--             'Branch.set_last_revision_info', 'Branch.unlock'],
++            ['Repository.insert_stream_1.19', 'Repository.insert_stream_1.19',
++             'get', 'Branch.set_last_revision_info', 'Branch.unlock'],
              calls_after_insert_stream)
      def disableOptimisticGetParentMap(self):
 === modified file 'bzrlib/tests/per_interrepository/__init__.py'
 --- bzrlib/tests/per_interrepository/__init__.py	2009-08-11 21:02:46 +0000
 +++ bzrlib/tests/per_interrepository/__init__.py	2009-08-14 05:35:33 +0000
@@ -32,8 +32,6 @@
+     )
  from bzrlib.repository import (
--    InterDifferingSerializer,
--    InterKnitRepo,
      InterRepository,
+     )
  from bzrlib.tests import (
@@ -48,18 +46,16 @@
      """Transform the input formats to a list of scenarios.
      :param formats: A list of tuples:
--        (interrepo_class, repository_format, repository_format_to).
++        (label, repository_format, repository_format_to).
      """
      result = []
--    for interrepo_class, repository_format, repository_format_to in formats:
--        id = '%s,%s,%s' % (interrepo_class.__name__,
--                            repository_format.__class__.__name__,
--                            repository_format_to.__class__.__name__)
++    for label, repository_format, repository_format_to in formats:
++        id = '%s,%s,%s' % (label, repository_format.__class__.__name__,
++                           repository_format_to.__class__.__name__)
          scenario = (id,
              {"transport_server": transport_server,
               "transport_readonly_server": transport_readonly_server,
               "repository_format": repository_format,
--             "interrepo_class": interrepo_class,
               "repository_format_to": repository_format_to,
               })
          result.append(scenario)
@@ -75,6 +71,8 @@
          weaverepo,
+         )
      result = []
++    def add_combo(label, from_format, to_format):
++        result.append((label, from_format, to_format))
      # test the default InterRepository between format 6 and the current
      # default format.
      # XXX: robertc 20060220 reinstate this when there are two supported
@@ -85,40 +83,48 @@
      for optimiser_class in InterRepository._optimisers:
          format_to_test = optimiser_class._get_repo_format_to_test()
          if format_to_test is not None:
--            result.append((optimiser_class,
--                           format_to_test, format_to_test))
++            add_combo(optimiser_class.__name__, format_to_test, format_to_test)
      # if there are specific combinations we want to use, we can add them
      # here. We want to test rich root upgrading.
--    result.append((InterRepository,
--                   weaverepo.RepositoryFormat5(),
--                   knitrepo.RepositoryFormatKnit3()))
--    result.append((InterRepository,
--                   knitrepo.RepositoryFormatKnit1(),
--                   knitrepo.RepositoryFormatKnit3()))
--    result.append((InterRepository,
--                   knitrepo.RepositoryFormatKnit1(),
--                   knitrepo.RepositoryFormatKnit3()))
--    result.append((InterKnitRepo,
--                   knitrepo.RepositoryFormatKnit1(),
--                   pack_repo.RepositoryFormatKnitPack1()))
--    result.append((InterKnitRepo,
--                   pack_repo.RepositoryFormatKnitPack1(),
--                   knitrepo.RepositoryFormatKnit1()))
--    result.append((InterKnitRepo,
--                   knitrepo.RepositoryFormatKnit3(),
--                   pack_repo.RepositoryFormatKnitPack3()))
--    result.append((InterKnitRepo,
--                   pack_repo.RepositoryFormatKnitPack3(),
--                   knitrepo.RepositoryFormatKnit3()))
--    result.append((InterKnitRepo,
--                   pack_repo.RepositoryFormatKnitPack3(),
--                   pack_repo.RepositoryFormatKnitPack4()))
--    result.append((InterDifferingSerializer,
--                   pack_repo.RepositoryFormatKnitPack1(),
--                   pack_repo.RepositoryFormatKnitPack6RichRoot()))
--    result.append((InterDifferingSerializer,
--                   pack_repo.RepositoryFormatKnitPack6RichRoot(),
--                   groupcompress_repo.RepositoryFormat2a()))
++    # XXX: although we attach InterRepository class names to these scenarios,
++    # there's nothing asserting that these labels correspond to what is
++    # actually used.
++    add_combo('InterRepository',
++              weaverepo.RepositoryFormat5(),
++              knitrepo.RepositoryFormatKnit3())
++    add_combo('InterRepository',
++              knitrepo.RepositoryFormatKnit1(),
++              knitrepo.RepositoryFormatKnit3())
++    add_combo('InterKnitRepo',
++              knitrepo.RepositoryFormatKnit1(),
++              pack_repo.RepositoryFormatKnitPack1())
++    add_combo('InterKnitRepo',
++              pack_repo.RepositoryFormatKnitPack1(),
++              knitrepo.RepositoryFormatKnit1())
++    add_combo('InterKnitRepo',
++              knitrepo.RepositoryFormatKnit3(),
++              pack_repo.RepositoryFormatKnitPack3())
++    add_combo('InterKnitRepo',
++              pack_repo.RepositoryFormatKnitPack3(),
++              knitrepo.RepositoryFormatKnit3())
++    add_combo('InterKnitRepo',
++              pack_repo.RepositoryFormatKnitPack3(),
++              pack_repo.RepositoryFormatKnitPack4())
++    add_combo('InterDifferingSerializer',
++              pack_repo.RepositoryFormatKnitPack1(),
++              pack_repo.RepositoryFormatKnitPack6RichRoot())
++    add_combo('InterDifferingSerializer',
++              pack_repo.RepositoryFormatKnitPack6RichRoot(),
++              groupcompress_repo.RepositoryFormat2a())
++    add_combo('InterDifferingSerializer',
++              groupcompress_repo.RepositoryFormat2a(),
++              pack_repo.RepositoryFormatKnitPack6RichRoot())
++    add_combo('InterRepository',
++              groupcompress_repo.RepositoryFormatCHK2(),
++              groupcompress_repo.RepositoryFormat2a())
++    add_combo('InterDifferingSerializer',
++              groupcompress_repo.RepositoryFormatCHK1(),
++              groupcompress_repo.RepositoryFormat2a())
      return result
 === modified file 'bzrlib/tests/per_interrepository/test_fetch.py'
 --- bzrlib/tests/per_interrepository/test_fetch.py	2009-08-12 02:21:06 +0000
 +++ bzrlib/tests/per_interrepository/test_fetch.py	2009-08-14 05:35:33 +0000
@@ -28,6 +28,9 @@
  from bzrlib.errors import (
      NoSuchRevision,
+     )
++from bzrlib.graph import (
++    SearchResult,
++    )
  from bzrlib.revision import (
      NULL_REVISION,
      Revision,
@@ -124,6 +127,15 @@
              to_repo.texts.get_record_stream([('foo', revid)],
              'unordered', True).next().get_bytes_as('fulltext'))
++    def test_fetch_parent_inventories_at_stacking_boundary_smart(self):
++        self.setup_smart_server_with_call_log()
++        self.test_fetch_parent_inventories_at_stacking_boundary()
++
++    def test_fetch_parent_inventories_at_stacking_boundary_smart_old(self):
++        self.setup_smart_server_with_call_log()
++        self.disable_verb('Repository.insert_stream_1.19')
++        self.test_fetch_parent_inventories_at_stacking_boundary()
++
      def test_fetch_parent_inventories_at_stacking_boundary(self):
          """Fetch to a stacked branch copies inventories for parents of
          revisions at the stacking boundary.
@@ -180,6 +192,28 @@
          self.assertEqual(left_tree.inventory, stacked_left_tree.inventory)
          self.assertEqual(right_tree.inventory, stacked_right_tree.inventory)
++        # Finally, it's not enough to see that the basis inventories are
++        # present.  The texts introduced in merge (and only those) should be
++        # present, and also generating a stream should succeed without blowing
++        # up.
++        self.assertTrue(unstacked_repo.has_revision('merge'))
++        expected_texts = set([('file-id', 'merge')])
++        if stacked_branch.repository.texts.get_parent_map([('root-id',
++            'merge')]):
++            # If a (root-id,merge) text exists, it should be in the stacked
++            # repo.
++            expected_texts.add(('root-id', 'merge'))
++        self.assertEqual(expected_texts, unstacked_repo.texts.keys())
++        self.assertCanStreamRevision(unstacked_repo, 'merge')
++
++    def assertCanStreamRevision(self, repo, revision_id):
++        exclude_keys = set(repo.all_revision_ids()) - set([revision_id])
++        search = SearchResult([revision_id], exclude_keys, 1, [revision_id])
++        source = repo._get_source(repo._format)
++        for substream_kind, substream in source.get_stream(search):
++            # Consume the substream
++            list(substream)
++
      def test_fetch_across_stacking_boundary_ignores_ghost(self):
          if not self.repository_format_to.supports_external_lookups:
              raise TestNotApplicable("Need stacking support in the target.")
@@ -218,6 +252,19 @@
          self.addCleanup(stacked_branch.unlock)
          stacked_second_tree = stacked_branch.repository.revision_tree('second')
          self.assertEqual(second_tree.inventory, stacked_second_tree.inventory)
++        # Finally, it's not enough to see that the basis inventories are
++        # present.  The texts introduced in merge (and only those) should be
++        # present, and also generating a stream should succeed without blowing
++        # up.
++        self.assertTrue(unstacked_repo.has_revision('third'))
++        expected_texts = set([('file-id', 'third')])
++        if stacked_branch.repository.texts.get_parent_map([('root-id',
++            'third')]):
++            # If a (root-id,third) text exists, it should be in the stacked
++            # repo.
++            expected_texts.add(('root-id', 'third'))
++        self.assertEqual(expected_texts, unstacked_repo.texts.keys())
++        self.assertCanStreamRevision(unstacked_repo, 'third')
      def test_fetch_missing_basis_text(self):
          """If fetching a delta, we should die if a basis is not present."""
 === modified file 'bzrlib/tests/per_pack_repository.py'
 --- bzrlib/tests/per_pack_repository.py	2009-08-12 22:28:28 +0000
 +++ bzrlib/tests/per_pack_repository.py	2009-08-14 05:35:32 +0000
@@ -1051,8 +1051,8 @@
          tree.branch.push(remote_branch)
          autopack_calls = len([call for call in self.hpss_calls if call ==
              'PackRepository.autopack'])
--        streaming_calls = len([call for call in self.hpss_calls if call ==
--            'Repository.insert_stream'])
++        streaming_calls = len([call for call in self.hpss_calls if call in
++            ('Repository.insert_stream', 'Repository.insert_stream_1.19')])
          if autopack_calls:
              # Non streaming server
              self.assertEqual(1, autopack_calls)
 === modified file 'bzrlib/tests/test_inventory_delta.py'
 --- bzrlib/tests/test_inventory_delta.py	2009-04-02 05:53:12 +0000
 +++ bzrlib/tests/test_inventory_delta.py	2009-08-14 05:35:32 +0000
@@ -26,6 +26,7 @@
      inventory,
      inventory_delta,
+     )
++from bzrlib.inventory_delta import InventoryDeltaError
  from bzrlib.inventory import Inventory
  from bzrlib.revision import NULL_REVISION
  from bzrlib.tests import TestCase
@@ -68,7 +69,7 @@
  version: entry-version
  versioned_root: false
  tree_references: false
--None\x00/\x00TREE_ROOT\x00\x00null:\x00dir
++None\x00/\x00TREE_ROOT\x00\x00entry-version\x00dir
  """
  reference_lines = """format: bzr inventory delta v1 (bzr 1.14)
@@ -93,39 +94,34 @@
      """Test InventoryDeltaSerializer.parse_text_bytes."""
      def test_parse_no_bytes(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          err = self.assertRaises(
--            errors.BzrError, serializer.parse_text_bytes, '')
--        self.assertContainsRe(str(err), 'unknown format')
++            InventoryDeltaError, deserializer.parse_text_bytes, '')
++        self.assertContainsRe(str(err), 'last line not empty')
      def test_parse_bad_format(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, 'format: foo\n')
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, 'format: foo\n')
          self.assertContainsRe(str(err), 'unknown format')
      def test_parse_no_parent(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes,
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes,
              'format: bzr inventory delta v1 (bzr 1.14)\n')
          self.assertContainsRe(str(err), 'missing parent: marker')
      def test_parse_no_version(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes,
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes,
              'format: bzr inventory delta v1 (bzr 1.14)\n'
              'parent: null:\n')
          self.assertContainsRe(str(err), 'missing version: marker')
      def test_parse_duplicate_key_errors(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          double_root_lines = \
  """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
@@ -135,24 +131,23 @@
  None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00
  None\x00/\x00an-id\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, double_root_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, double_root_lines)
          self.assertContainsRe(str(err), 'duplicate file id')
      def test_parse_versioned_root_only(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
--        parse_result = serializer.parse_text_bytes(root_only_lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(root_only_lines)
          expected_entry = inventory.make_entry(
              'directory', u'', None, 'an-id')
          expected_entry.revision = 'a@e\xc3\xa5ample.com--2004'
          self.assertEqual(
--            ('null:', 'entry-version', [(None, '/', 'an-id', expected_entry)]),
++            ('null:', 'entry-version', True, True,
++             [(None, '', 'an-id', expected_entry)]),
              parse_result)
      def test_parse_special_revid_not_valid_last_mod(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=False, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          root_only_lines = """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
  version: null:
@@ -160,13 +155,12 @@
  tree_references: true
  None\x00/\x00TREE_ROOT\x00\x00null:\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, root_only_lines)
          self.assertContainsRe(str(err), 'special revisionid found')
      def test_parse_versioned_root_versioned_disabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=False, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          root_only_lines = """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
  version: null:
@@ -174,39 +168,134 @@
  tree_references: true
  None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, root_only_lines)
          self.assertContainsRe(str(err), 'Versioned root found')
      def test_parse_unique_root_id_root_versioned_disabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=False, tree_references=True)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
          root_only_lines = """format: bzr inventory delta v1 (bzr 1.14)
--parent: null:
--version: null:
++parent: parent-id
++version: a@e\xc3\xa5ample.com--2004
  versioned_root: false
  tree_references: true
--None\x00/\x00an-id\x00\x00null:\x00dir\x00\x00
++None\x00/\x00an-id\x00\x00parent-id\x00dir\x00\x00
  """
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_lines)
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, root_only_lines)
          self.assertContainsRe(str(err), 'Versioned root found')
      def test_parse_unversioned_root_versioning_enabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, root_only_unversioned)
--        self.assertContainsRe(
--            str(err), 'serialized versioned_root flag is wrong: False')
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(root_only_unversioned)
++        expected_entry = inventory.make_entry(
++            'directory', u'', None, 'TREE_ROOT')
++        expected_entry.revision = 'entry-version'
++        self.assertEqual(
++            ('null:', 'entry-version', False, False,
++             [(None, u'', 'TREE_ROOT', expected_entry)]),
++            parse_result)
++
++    def test_parse_versioned_root_when_disabled(self):
++        deserializer = inventory_delta.InventoryDeltaDeserializer(
++            allow_versioned_root=False)
++        err = self.assertRaises(inventory_delta.IncompatibleInventoryDelta,
++            deserializer.parse_text_bytes, root_only_lines)
++        self.assertEquals("versioned_root not allowed", str(err))
      def test_parse_tree_when_disabled(self):
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=True, tree_references=False)
--        err = self.assertRaises(errors.BzrError,
--            serializer.parse_text_bytes, reference_lines)
--        self.assertContainsRe(
--            str(err), 'serialized tree_references flag is wrong: True')
++        deserializer = inventory_delta.InventoryDeltaDeserializer(
++            allow_tree_references=False)
++        err = self.assertRaises(inventory_delta.IncompatibleInventoryDelta,
++            deserializer.parse_text_bytes, reference_lines)
++        self.assertEquals("Tree reference not allowed", str(err))
++
++    def test_parse_tree_when_header_disallows(self):
++        # A deserializer that allows tree_references to be set or unset.
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        # A serialised inventory delta with a header saying no tree refs, but
++        # that has a tree ref in its content.
++        lines = """format: bzr inventory delta v1 (bzr 1.14)
++parent: null:
++version: entry-version
++versioned_root: false
++tree_references: false
++None\x00/foo\x00id\x00TREE_ROOT\x00changed\x00tree\x00subtree-version
++"""
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
++        self.assertContainsRe(str(err), 'Tree reference found')
++
++    def test_parse_versioned_root_when_header_disallows(self):
++        # A deserializer that allows tree_references to be set or unset.
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        # A serialised inventory delta with a header saying no tree refs, but
++        # that has a tree ref in its content.
++        lines = """format: bzr inventory delta v1 (bzr 1.14)
++parent: null:
++version: entry-version
++versioned_root: false
++tree_references: false
++None\x00/\x00TREE_ROOT\x00\x00a@e\xc3\xa5ample.com--2004\x00dir
++"""
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
++        self.assertContainsRe(str(err), 'Versioned root found')
++
++    def test_parse_last_line_not_empty(self):
++        """newpath must start with / if it is not None."""
++        # Trim the trailing newline from a valid serialization
++        lines = root_only_lines[:-1]
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
++        self.assertContainsRe(str(err), 'last line not empty')
++
++    def test_parse_invalid_newpath(self):
++        """newpath must start with / if it is not None."""
++        lines = empty_lines
++        lines += "None\x00bad\x00TREE_ROOT\x00\x00version\x00dir\n"
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
++        self.assertContainsRe(str(err), 'newpath invalid')
++
++    def test_parse_invalid_oldpath(self):
++        """oldpath must start with / if it is not None."""
++        lines = root_only_lines
++        lines += "bad\x00/new\x00file-id\x00\x00version\x00dir\n"
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        err = self.assertRaises(InventoryDeltaError,
++            deserializer.parse_text_bytes, lines)
++        self.assertContainsRe(str(err), 'oldpath invalid')
++
++    def test_parse_new_file(self):
++        """a new file is parsed correctly"""
++        lines = root_only_lines
++        fake_sha = "deadbeef" * 5
++        lines += (
++            "None\x00/new\x00file-id\x00an-id\x00version\x00file\x00123\x00" +
++            "\x00" + fake_sha + "\n")
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(lines)
++        expected_entry = inventory.make_entry(
++            'file', u'new', 'an-id', 'file-id')
++        expected_entry.revision = 'version'
++        expected_entry.text_size = 123
++        expected_entry.text_sha1 = fake_sha
++        delta = parse_result[4]
++        self.assertEqual(
++             (None, u'new', 'file-id', expected_entry), delta[-1])
++
++    def test_parse_delete(self):
++        lines = root_only_lines
++        lines += (
++            "/old-file\x00None\x00deleted-id\x00\x00null:\x00deleted\x00\x00\n")
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        parse_result = deserializer.parse_text_bytes(lines)
++        delta = parse_result[4]
++        self.assertEqual(
++             (u'old-file', None, 'deleted-id', None), delta[-1])
  class TestSerialization(TestCase):
@@ -237,12 +326,20 @@
          old_inv = Inventory(None)
          new_inv = Inventory(None)
          root = new_inv.make_entry('directory', '', None, 'TREE_ROOT')
++        # Implicit roots are considered modified in every revision.
++        root.revision = 'entry-version'
          new_inv.add(root)
          delta = new_inv._make_delta(old_inv)
          serializer = inventory_delta.InventoryDeltaSerializer(
              versioned_root=False, tree_references=False)
++        serialized_lines = serializer.delta_to_lines(
++            NULL_REVISION, 'entry-version', delta)
          self.assertEqual(StringIO(root_only_unversioned).readlines(),
--            serializer.delta_to_lines(NULL_REVISION, 'entry-version', delta))
++            serialized_lines)
++        deserializer = inventory_delta.InventoryDeltaDeserializer()
++        self.assertEqual(
++            (NULL_REVISION, 'entry-version', False, False, delta),
++            deserializer.parse_text_bytes(''.join(serialized_lines)))
      def test_unversioned_non_root_errors(self):
          old_inv = Inventory(None)
@@ -255,7 +352,7 @@
          delta = new_inv._make_delta(old_inv)
          serializer = inventory_delta.InventoryDeltaSerializer(
              versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
++        err = self.assertRaises(InventoryDeltaError,
              serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
          self.assertEqual(str(err), 'no version for fileid id')
@@ -267,7 +364,7 @@
          delta = new_inv._make_delta(old_inv)
          serializer = inventory_delta.InventoryDeltaSerializer(
              versioned_root=True, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
++        err = self.assertRaises(InventoryDeltaError,
              serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
          self.assertEqual(str(err), 'no version for fileid TREE_ROOT')
@@ -280,22 +377,9 @@
          delta = new_inv._make_delta(old_inv)
          serializer = inventory_delta.InventoryDeltaSerializer(
              versioned_root=False, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
--        self.assertEqual(str(err), 'Version present for / in TREE_ROOT')
--
--    def test_nonrichroot_non_TREE_ROOT_id_errors(self):
--        old_inv = Inventory(None)
--        new_inv = Inventory(None)
--        root = new_inv.make_entry('directory', '', None, 'my-rich-root-id')
--        new_inv.add(root)
--        delta = new_inv._make_delta(old_inv)
--        serializer = inventory_delta.InventoryDeltaSerializer(
--            versioned_root=False, tree_references=True)
--        err = self.assertRaises(errors.BzrError,
--            serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
--        self.assertEqual(
--            str(err), 'file_id my-rich-root-id is not TREE_ROOT for /')
++        err = self.assertRaises(InventoryDeltaError,
++            serializer.delta_to_lines, NULL_REVISION, 'entry-version', delta)
++        self.assertStartsWith(str(err), 'Version present for / in TREE_ROOT')
      def test_unknown_kind_errors(self):
          old_inv = Inventory(None)
@@ -354,19 +438,22 @@
              serializer.delta_to_lines(NULL_REVISION, 'entry-version', delta))
      def test_to_inventory_root_id_versioned_not_permitted(self):
--        delta = [(None, '/', 'TREE_ROOT', inventory.make_entry(
--            'directory', '', None, 'TREE_ROOT'))]
--        serializer = inventory_delta.InventoryDeltaSerializer(False, True)
++        root_entry = inventory.make_entry('directory', '', None, 'TREE_ROOT')
++        root_entry.revision = 'some-version'
++        delta = [(None, '', 'TREE_ROOT', root_entry)]
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=False, tree_references=True)
          self.assertRaises(
--            errors.BzrError, serializer.delta_to_lines, 'old-version',
++            InventoryDeltaError, serializer.delta_to_lines, 'old-version',
              'new-version', delta)
      def test_to_inventory_root_id_not_versioned(self):
--        delta = [(None, '/', 'an-id', inventory.make_entry(
++        delta = [(None, '', 'an-id', inventory.make_entry(
              'directory', '', None, 'an-id'))]
--        serializer = inventory_delta.InventoryDeltaSerializer(True, True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          self.assertRaises(
--            errors.BzrError, serializer.delta_to_lines, 'old-version',
++            InventoryDeltaError, serializer.delta_to_lines, 'old-version',
              'new-version', delta)
      def test_to_inventory_has_tree_not_meant_to(self):
@@ -374,13 +461,14 @@
          tree_ref = make_entry('tree-reference', 'foo', 'changed-in', 'ref-id')
          tree_ref.reference_revision = 'ref-revision'
          delta = [
--            (None, '/', 'an-id',
++            (None, '', 'an-id',
               make_entry('directory', '', 'changed-in', 'an-id')),
--            (None, '/foo', 'ref-id', tree_ref)
++            (None, 'foo', 'ref-id', tree_ref)
              # a file that followed the root move
+             ]
--        serializer = inventory_delta.InventoryDeltaSerializer(True, True)
--        self.assertRaises(errors.BzrError, serializer.delta_to_lines,
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
++        self.assertRaises(InventoryDeltaError, serializer.delta_to_lines,
              'old-version', 'new-version', delta)
      def test_to_inventory_torture(self):
@@ -430,7 +518,8 @@
                     executable=True, text_size=30, text_sha1='some-sha',
                     revision='old-rev')),
+             ]
--        serializer = inventory_delta.InventoryDeltaSerializer(True, True)
++        serializer = inventory_delta.InventoryDeltaSerializer(
++            versioned_root=True, tree_references=True)
          lines = serializer.delta_to_lines(NULL_REVISION, 'something', delta)
          expected = """format: bzr inventory delta v1 (bzr 1.14)
  parent: null:
@@ -483,13 +572,13 @@
      def test_file_without_size(self):
          file_entry = inventory.make_entry('file', 'a file', None, 'file-id')
          file_entry.text_sha1 = 'foo'
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._file_content, file_entry)
      def test_file_without_sha1(self):
          file_entry = inventory.make_entry('file', 'a file', None, 'file-id')
          file_entry.text_size = 10
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._file_content, file_entry)
      def test_link_empty_target(self):
@@ -512,7 +601,7 @@
      def test_link_no_target(self):
          entry = inventory.make_entry('symlink', 'a link', None)
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._link_content, entry)
      def test_reference_null(self):
@@ -529,5 +618,5 @@
      def test_reference_no_reference(self):
          entry = inventory.make_entry('tree-reference', 'a tree', None)
--        self.assertRaises(errors.BzrError,
++        self.assertRaises(InventoryDeltaError,
              inventory_delta._reference_content, entry)
 === modified file 'bzrlib/tests/test_remote.py'
 --- bzrlib/tests/test_remote.py	2009-08-11 05:26:57 +0000
 +++ bzrlib/tests/test_remote.py	2009-08-14 05:35:32 +0000
@@ -31,6 +31,8 @@
      config,
      errors,
      graph,
++    inventory,
++    inventory_delta,
      pack,
      remote,
      repository,
@@ -38,6 +40,7 @@
      tests,
      treebuilder,
      urlutils,
++    versionedfile,
+     )
  from bzrlib.branch import Branch
  from bzrlib.bzrdir import BzrDir, BzrDirFormat
@@ -332,15 +335,6 @@
          reference_bzrdir_format = bzrdir.format_registry.get('default')()
          return reference_bzrdir_format.repository_format
--    def disable_verb(self, verb):
--        """Disable a verb for one test."""
--        request_handlers = smart.request.request_handlers
--        orig_method = request_handlers.get(verb)
--        request_handlers.remove(verb)
--        def restoreVerb():
--            request_handlers.register(verb, orig_method)
--        self.addCleanup(restoreVerb)
--
      def assertFinished(self, fake_client):
          """Assert that all of a FakeClient's expected calls have occurred."""
          fake_client.finished_test()
@@ -2219,63 +2213,219 @@
          self.assertEqual([], client._calls)
--class TestRepositoryInsertStream(TestRemoteRepository):
--
--    def test_unlocked_repo(self):
--        transport_path = 'quack'
--        repo, client = self.setup_fake_client_and_repository(transport_path)
--        client.add_expected_call(
--            'Repository.insert_stream', ('quack/', ''),
--            'success', ('ok',))
--        client.add_expected_call(
--            'Repository.insert_stream', ('quack/', ''),
--            'success', ('ok',))
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
--
--    def test_locked_repo_with_no_lock_token(self):
--        transport_path = 'quack'
--        repo, client = self.setup_fake_client_and_repository(transport_path)
--        client.add_expected_call(
--            'Repository.lock_write', ('quack/', ''),
--            'success', ('ok', ''))
--        client.add_expected_call(
--            'Repository.insert_stream', ('quack/', ''),
--            'success', ('ok',))
--        client.add_expected_call(
--            'Repository.insert_stream', ('quack/', ''),
--            'success', ('ok',))
--        repo.lock_write()
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
--
--    def test_locked_repo_with_lock_token(self):
--        transport_path = 'quack'
--        repo, client = self.setup_fake_client_and_repository(transport_path)
--        client.add_expected_call(
--            'Repository.lock_write', ('quack/', ''),
--            'success', ('ok', 'a token'))
--        client.add_expected_call(
--            'Repository.insert_stream_locked', ('quack/', '', 'a token'),
--            'success', ('ok',))
--        client.add_expected_call(
--            'Repository.insert_stream_locked', ('quack/', '', 'a token'),
--            'success', ('ok',))
--        repo.lock_write()
--        sink = repo._get_sink()
--        fmt = repository.RepositoryFormat.get_default_format()
--        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
--        self.assertEqual([], resume_tokens)
--        self.assertEqual(set(), missing_keys)
--        self.assertFinished(client)
++class TestRepositoryInsertStreamBase(TestRemoteRepository):
++    """Base class for Repository.insert_stream and .insert_stream_1.19
++    tests.
++    """
++
++    def checkInsertEmptyStream(self, repo, client):
++        """Insert an empty stream, checking the result.
++
++        This checks that there are no resume_tokens or missing_keys, and that
++        the client is finished.
++        """
++        sink = repo._get_sink()
++        fmt = repository.RepositoryFormat.get_default_format()
++        resume_tokens, missing_keys = sink.insert_stream([], fmt, [])
++        self.assertEqual([], resume_tokens)
++        self.assertEqual(set(), missing_keys)
++        self.assertFinished(client)
++
++
++class TestRepositoryInsertStream(TestRepositoryInsertStreamBase):
++    """Tests for using Repository.insert_stream verb when the _1.19 variant is
++    not available.
++
++    This test case is very similar to TestRepositoryInsertStream_1_19.
++    """
++
++    def setUp(self):
++        TestRemoteRepository.setUp(self)
++        self.disable_verb('Repository.insert_stream_1.19')
++
++    def test_unlocked_repo(self):
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'unknown', ('Repository.insert_stream_1.19',))
++        client.add_expected_call(
++            'Repository.insert_stream', ('quack/', ''),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream', ('quack/', ''),
++            'success', ('ok',))
++        self.checkInsertEmptyStream(repo, client)
++
++    def test_locked_repo_with_no_lock_token(self):
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.lock_write', ('quack/', ''),
++            'success', ('ok', ''))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'unknown', ('Repository.insert_stream_1.19',))
++        client.add_expected_call(
++            'Repository.insert_stream', ('quack/', ''),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream', ('quack/', ''),
++            'success', ('ok',))
++        repo.lock_write()
++        self.checkInsertEmptyStream(repo, client)
++
++    def test_locked_repo_with_lock_token(self):
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.lock_write', ('quack/', ''),
++            'success', ('ok', 'a token'))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', '', 'a token'),
++            'unknown', ('Repository.insert_stream_1.19',))
++        client.add_expected_call(
++            'Repository.insert_stream_locked', ('quack/', '', 'a token'),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream_locked', ('quack/', '', 'a token'),
++            'success', ('ok',))
++        repo.lock_write()
++        self.checkInsertEmptyStream(repo, client)
++
++    def test_stream_with_inventory_deltas(self):
++        """'inventory-deltas' substreams cannot be sent to the
++        Repository.insert_stream verb, because not all servers that implement
++        that verb will accept them.  So when one is encountered the RemoteSink
++        immediately stops using that verb and falls back to VFS insert_stream.
++        """
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'unknown', ('Repository.insert_stream_1.19',))
++        client.add_expected_call(
++            'Repository.insert_stream', ('quack/', ''),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream', ('quack/', ''),
++            'success', ('ok',))
++        # Create a fake real repository for insert_stream to fall back on, so
++        # that we can directly see the records the RemoteSink passes to the
++        # real sink.
++        class FakeRealSink:
++            def __init__(self):
++                self.records = []
++            def insert_stream(self, stream, src_format, resume_tokens):
++                for substream_kind, substream in stream:
++                    self.records.append(
++                        (substream_kind, [record.key for record in substream]))
++                return ['fake tokens'], ['fake missing keys']
++        fake_real_sink = FakeRealSink()
++        class FakeRealRepository:
++            def _get_sink(self):
++                return fake_real_sink
++        repo._real_repository = FakeRealRepository()
++        sink = repo._get_sink()
++        fmt = repository.RepositoryFormat.get_default_format()
++        stream = self.make_stream_with_inv_deltas(fmt)
++        resume_tokens, missing_keys = sink.insert_stream(stream, fmt, [])
++        # Every record from the first inventory delta should have been sent to
++        # the VFS sink.
++        expected_records = [
++            ('inventory-deltas', [('rev2',), ('rev3',)]),
++            ('texts', [('some-rev', 'some-file')])]
++        self.assertEqual(expected_records, fake_real_sink.records)
++        # The return values from the real sink's insert_stream are propagated
++        # back to the original caller.
++        self.assertEqual(['fake tokens'], resume_tokens)
++        self.assertEqual(['fake missing keys'], missing_keys)
++        self.assertFinished(client)
++
++    def make_stream_with_inv_deltas(self, fmt):
++        """Make a simple stream with an inventory delta followed by more
++        records and more substreams to test that all records and substreams
++        from that point on are used.
++
++        This sends, in order:
++           * inventories substream: rev1, rev2, rev3.  rev2 and rev3 are
++             inventory-deltas.
++           * texts substream: (some-rev, some-file)
++        """
++        # Define a stream using generators so that it isn't rewindable.
++        inv = inventory.Inventory(revision_id='rev1')
++        def stream_with_inv_delta():
++            yield ('inventories', inventories_substream())
++            yield ('inventory-deltas', inventory_delta_substream())
++            yield ('texts', [
++                versionedfile.FulltextContentFactory(
++                    ('some-rev', 'some-file'), (), None, 'content')])
++        def inventories_substream():
++            # An empty inventory fulltext.  This will be streamed normally.
++            text = fmt._serializer.write_inventory_to_string(inv)
++            yield versionedfile.FulltextContentFactory(
++                ('rev1',), (), None, text)
++        def inventory_delta_substream():
++            # An inventory delta.  This can't be streamed via this verb, so it
++            # will trigger a fallback to VFS insert_stream.
++            entry = inv.make_entry(
++                'directory', 'newdir', inv.root.file_id, 'newdir-id')
++            entry.revision = 'ghost'
++            delta = [(None, 'newdir', 'newdir-id', entry)]
++            serializer = inventory_delta.InventoryDeltaSerializer(
++                versioned_root=True, tree_references=False)
++            lines = serializer.delta_to_lines('rev1', 'rev2', delta)
++            yield versionedfile.ChunkedContentFactory(
++                ('rev2',), (('rev1',)), None, lines)
++            # Another delta.
++            lines = serializer.delta_to_lines('rev1', 'rev3', delta)
++            yield versionedfile.ChunkedContentFactory(
++                ('rev3',), (('rev1',)), None, lines)
++        return stream_with_inv_delta()
++
++
++class TestRepositoryInsertStream_1_19(TestRepositoryInsertStreamBase):
++
++    def test_unlocked_repo(self):
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'success', ('ok',))
++        self.checkInsertEmptyStream(repo, client)
++
++    def test_locked_repo_with_no_lock_token(self):
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.lock_write', ('quack/', ''),
++            'success', ('ok', ''))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', ''),
++            'success', ('ok',))
++        repo.lock_write()
++        self.checkInsertEmptyStream(repo, client)
++
++    def test_locked_repo_with_lock_token(self):
++        transport_path = 'quack'
++        repo, client = self.setup_fake_client_and_repository(transport_path)
++        client.add_expected_call(
++            'Repository.lock_write', ('quack/', ''),
++            'success', ('ok', 'a token'))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', '', 'a token'),
++            'success', ('ok',))
++        client.add_expected_call(
++            'Repository.insert_stream_1.19', ('quack/', '', 'a token'),
++            'success', ('ok',))
++        repo.lock_write()
++        self.checkInsertEmptyStream(repo, client)
  class TestRepositoryTarball(TestRemoteRepository):
 === modified file 'bzrlib/tests/test_selftest.py'
 --- bzrlib/tests/test_selftest.py	2009-08-04 02:09:19 +0000
 +++ bzrlib/tests/test_selftest.py	2009-08-14 05:35:32 +0000
@@ -124,7 +124,7 @@
          self.assertEqual(sample_permutation,
                           get_transport_test_permutations(MockModule()))
--    def test_scenarios_invlude_all_modules(self):
++    def test_scenarios_include_all_modules(self):
          # this checks that the scenario generator returns as many permutations
          # as there are in all the registered transport modules - we assume if
          # this matches its probably doing the right thing especially in
@@ -293,18 +293,16 @@
          from bzrlib.tests.per_interrepository import make_scenarios
          server1 = "a"
          server2 = "b"
--        formats = [(str, "C1", "C2"), (int, "D1", "D2")]
++        formats = [("C0", "C1", "C2"), ("D0", "D1", "D2")]
          scenarios = make_scenarios(server1, server2, formats)
          self.assertEqual([
--            ('str,str,str',
--             {'interrepo_class': str,
--              'repository_format': 'C1',
++            ('C0,str,str',
++             {'repository_format': 'C1',
                'repository_format_to': 'C2',
                'transport_readonly_server': 'b',
                'transport_server': 'a'}),
--            ('int,str,str',
--             {'interrepo_class': int,
--              'repository_format': 'D1',
++            ('D0,str,str',
++             {'repository_format': 'D1',
                'repository_format_to': 'D2',
                'transport_readonly_server': 'b',
                'transport_server': 'a'})],
 === modified file 'bzrlib/tests/test_smart.py'
 --- bzrlib/tests/test_smart.py	2009-07-23 07:37:05 +0000
 +++ bzrlib/tests/test_smart.py	2009-08-14 05:35:32 +0000
@@ -1242,6 +1242,7 @@
              SmartServerResponse(('history-incomplete', 2, r2)),
              request.execute('stacked', 1, (3, r3)))
++
  class TestSmartServerRepositoryGetStream(tests.TestCaseWithMemoryTransport):
      def make_two_commit_repo(self):
 === modified file 'bzrlib/tests/test_xml.py'
 --- bzrlib/tests/test_xml.py	2009-04-03 21:50:40 +0000
 +++ bzrlib/tests/test_xml.py	2009-08-14 05:35:32 +0000
@@ -19,6 +19,7 @@
  from bzrlib import (
      errors,
      inventory,
++    xml6,
      xml7,
      xml8,
      serializer,
@@ -139,6 +140,14 @@
  </inventory>
  """
++_expected_inv_v6 = """<inventory format="6" revision_id="rev_outer">
++<directory file_id="tree-root-321" name="" revision="rev_outer" />
++<directory file_id="dir-id" name="dir" parent_id="tree-root-321" revision="rev_outer" />
++<file file_id="file-id" name="file" parent_id="tree-root-321" revision="rev_outer" text_sha1="A" text_size="1" />
++<symlink file_id="link-id" name="link" parent_id="tree-root-321" revision="rev_outer" symlink_target="a" />
++</inventory>
++"""
++
  _expected_inv_v7 = """<inventory format="7" revision_id="rev_outer">
  <directory file_id="tree-root-321" name="" revision="rev_outer" />
  <directory file_id="dir-id" name="dir" parent_id="tree-root-321" revision="rev_outer" />
@@ -377,6 +386,17 @@
          for path, ie in inv.iter_entries():
              self.assertEqual(ie, inv2[ie.file_id])
++    def test_roundtrip_inventory_v6(self):
++        inv = self.get_sample_inventory()
++        txt = xml6.serializer_v6.write_inventory_to_string(inv)
++        lines = xml6.serializer_v6.write_inventory_to_lines(inv)
++        self.assertEqual(bzrlib.osutils.split_lines(txt), lines)
++        self.assertEqualDiff(_expected_inv_v6, txt)
++        inv2 = xml6.serializer_v6.read_inventory_from_string(txt)
++        self.assertEqual(4, len(inv2))
++        for path, ie in inv.iter_entries():
++            self.assertEqual(ie, inv2[ie.file_id])
++
      def test_wrong_format_v7(self):
          """Can't accidentally open a file with wrong serializer"""
          s_v6 = bzrlib.xml6.serializer_v6
 === modified file 'bzrlib/versionedfile.py'
 --- bzrlib/versionedfile.py	2009-08-04 04:36:34 +0000
 +++ bzrlib/versionedfile.py	2009-08-14 05:35:32 +0000
@@ -1571,13 +1571,14 @@
              record.get_bytes_as(record.storage_kind) call.
          """
          self._bytes_iterator = bytes_iterator
--        self._kind_factory = {'knit-ft-gz':knit.knit_network_to_record,
--            'knit-delta-gz':knit.knit_network_to_record,
--            'knit-annotated-ft-gz':knit.knit_network_to_record,
--            'knit-annotated-delta-gz':knit.knit_network_to_record,
--            'knit-delta-closure':knit.knit_delta_closure_to_records,
--            'fulltext':fulltext_network_to_record,
--            'groupcompress-block':groupcompress.network_block_to_records,
++        self._kind_factory = {
++            'fulltext': fulltext_network_to_record,
++            'groupcompress-block': groupcompress.network_block_to_records,
++            'knit-ft-gz': knit.knit_network_to_record,
++            'knit-delta-gz': knit.knit_network_to_record,
++            'knit-annotated-ft-gz': knit.knit_network_to_record,
++            'knit-annotated-delta-gz': knit.knit_network_to_record,
++            'knit-delta-closure': knit.knit_delta_closure_to_records,
+             }
      def read(self):
 === modified file 'bzrlib/xml5.py'
 --- bzrlib/xml5.py	2009-04-04 02:50:01 +0000
 +++ bzrlib/xml5.py	2009-08-14 05:35:32 +0000
@@ -39,8 +39,8 @@
          format = elt.get('format')
          if format is not None:
              if format != '5':
--                raise BzrError("invalid format version %r on inventory"
--                                % format)
++                raise errors.BzrError("invalid format version %r on inventory"
++                                      % format)
          data_revision_id = elt.get('revision_id')
          if data_revision_id is not None:
              revision_id = cache_utf8.encode(data_revision_id)