Bazaar

Code review comment for lp:~spiv/bzr/inventory-delta

inventory-delta
Merge into trunk-old

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-16:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/inventory-delta into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This is a pretty big patch. It does lots of things:
>
> * adds new insert_stream and get_stream verbs
> * adds de/serialization of inventory-delta records on the network
> * fixes rich-root generation in StreamSource
> * adds a bunch of new scenarios to per_interrepository tests
> * fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
> the new pack is already optimal).
> * improves the inventory_delta module a little
> * various miscellaneous fixes and new tests that are hopefully self-evident
> * and, most controversially, removes InterDifferingSerializer.
>
>>From John's mail a while back there were a bunch of issues with removing IDS. I
> think the outstanding ones are:
>
>> 1) Incremental updates. IDS converts batches of 100 revs at a time,
>> which also triggers autopacks at 1k revs. Streaming fetch is currently
>> an all-or-nothing, which isn't appropriate (IMO) for conversions.
>> Consider that conversion can take *days*, it is important to have
>> something that can be stopped and resumed.

It also picks out the 'optimal' deltas by computing many different ones
and finding whichever one was the 'smallest'. For local conversions, the
time to compute 2-3 deltas was much smaller than to apply an inefficient
delta.

>>
>> 2) Also, auto-packing as you go avoids the case you ran into, where bzr
>> bloats to 2.4GB before packing back to 25MB. We know the new format is
>> even more sensitive to packing efficiency. Not to mention that a single
>> big-stream generates a single large pack, it isn't directly obvious that
>> we are being so inefficient.
>
> i.e. performance concerns.
>

Generally, yes.

There is also:

3) Being able to resume because you snapshotted periodically as you
went. This seems even more important for a network transfer.

> The streaming code is pretty similar in how it does the conversion now to the
> way IDS did it, but probably still different enough that we will want to measure
> the impact of this. I'm definitely concerned about case 2, the lack of packing
> as you go, although perhaps the degree of bloat is reduced by using
> semantic inventory-delta records?
>

I don't think bzr bloating from 100MB => 2.4GB (and then back down to
25MB post pack) was because of inventory records. However, if it was
purely because of a bad streaming order, we could probably fix that by
changing how we stream texts.

> The reason why I eventually deleted IDS was that it was just too burdensome to
> keep two code paths alive, thoroughly tested, and correct. For instance, if we
> simply reinstated IDS for local-only fetches then most of the test suite,
> including the relevant interrepo tests, will only exercise IDS. Also, IDS
> turned out to have a bug when used on a stacked repository that the extending
> test suite in this branch revealed (I've forgotten the details, but can dig them
> up if you like). It didn't seem worth the hassle of fixing IDS when I already
> had a working implementation.
>
> I'm certainly open to reinstating IDS if it's the most expedient way to have
> reasonable local performance for upgrades, but I thought I'd try to be bold and
> see if we could just live without the extra complexity. Maybe we can improve
> performance of streaming rather than resurrect IDS?
>
> -Andrew.

I'm certainly open to the suggestion of getting rid of IDS. I don't like
having multiple code paths. It just happens that there are *big* wins
and it is often easier to write optimized code in a different framework.

(3) is an issue I'd like to see addressed, but which Robert seems
particularly unhappy having us try to do. (See other bug comments, etc
about how other systems don't do it and he feels it isn't worth doing.)

It was fairly straightforward to do with IDS, the argument I think from
Robert is that the client would need to be computing whether it has a
'complete' set and thus can commit the current write group. (the
*source* knows these sort of things, and can just say "and now you have
it", but the client has to re-do all that work to figure it out from a
stream.)

I'll send a follow up email later to do more of a review of the code
changes.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfT4UACgkQJdeBCYSNAAM0kQCfTO0rlN9Zl1LDues6IpHdp7ju
FHEAoKgP+f81hxKB6o3iyVt5mYPoFyUD
=QSKC
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/inventory-delta into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> This is a pretty big patch.  It does lots of things:
> 
>  * adds new insert_stream and get_stream verbs
>  * adds de/serialization of inventory-delta records on the network
>  * fixes rich-root generation in StreamSource
>  * adds a bunch of new scenarios to per_interrepository tests
>  * fixes some 'pack already exist' bugs for packing a single GC pack (i.e. when
>    the new pack is already optimal).
>  * improves the inventory_delta module a little
>  * various miscellaneous fixes and new tests that are hopefully self-evident
>  * and, most controversially, removes InterDifferingSerializer.
> 
>>From John's mail a while back there were a bunch of issues with removing IDS.  I
> think the outstanding ones are:
> 
>> 1) Incremental updates. IDS converts batches of 100 revs at a time,
>> which also triggers autopacks at 1k revs. Streaming fetch is currently
>> an all-or-nothing, which isn't appropriate (IMO) for conversions.
>> Consider that conversion can take *days*, it is important to have
>> something that can be stopped and resumed.

Generally, yes.

There is also:

3) Being able to resume because you snapshotted periodically as you
went. This seems even more important for a network transfer.

> The streaming code is pretty similar in how it does the conversion now to the
> way IDS did it, but probably still different enough that we will want to measure
> the impact of this.  I'm definitely concerned about case 2, the lack of packing
> as you go, although perhaps the degree of bloat is reduced by using
> semantic inventory-delta records?
>

> The reason why I eventually deleted IDS was that it was just too burdensome to
> keep two code paths alive, thoroughly tested, and correct.  For instance, if we
> simply reinstated IDS for local-only fetches then most of the test suite,
> including the relevant interrepo tests, will only exercise IDS.  Also, IDS
> turned out to have a bug when used on a stacked repository that the extending
> test suite in this branch revealed (I've forgotten the details, but can dig them
> up if you like).  It didn't seem worth the hassle of fixing IDS when I already
> had a working implementation.
> 
> I'm certainly open to reinstating IDS if it's the most expedient way to have
> reasonable local performance for upgrades, but I thought I'd try to be bold and
> see if we could just live without the extra complexity.  Maybe we can improve
> performance of streaming rather than resurrect IDS?
> 
> -Andrew.

I'll send a follow up email later to do more of a review of the code
changes.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfT4UACgkQJdeBCYSNAAM0kQCfTO0rlN9Zl1LDues6IpHdp7ju
FHEAoKgP+f81hxKB6o3iyVt5mYPoFyUD
=QSKC
-----END PGP SIGNATURE-----

« Back to merge proposal