Merge into trunk-old : skip-dupes : Code : Bazaar

Status:	Merged
Merged at revision:	not available
Proposed branch:	lp:~abentley/bzr/skip-dupes
Merge into:	lp:~bzr/bzr/trunk-old
Diff against target:	350 lines
To merge this branch:	bzr merge lp:~abentley/bzr/skip-dupes
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Robert Collins (community)			Approve on 2009-06-27
Review via email: mp+7957@code.launchpad.net

Revision history for this message

Aaron Bentley (abentley) wrote on 2009-06-24: Posted in a previous version of this proposal

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

There are several issues related to Launchpad's current problems with
pulling data from incompletely-reconciled repositories. For one thing,
we're incorrectly predicting which records need to be inserted. For
another, we're inserting records that we already have, which wastes disk
space, and causes us to raise exceptions if the record metadata differs.

This patch allows bzr to avoid installing any records which are already
present, which seems like the most robust fix. It would also be nice to
investigate why we're sending too many records.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpCPBIACgkQ0F+nu1YWqI0x+wCdEmvIRyVfkIlvplwDMa5wIxDg
QlQAn2N40bkCb0CNM7jxUneiUmZmDRiz
=EDUm
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-24: Posted in a previous version of this proposal

#

On Wed, 2009-06-24 at 14:50 +0000, Aaron Bentley wrote:
> Aaron Bentley has proposed merging lp:~abentley/bzr/skip-dupes into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi all,
>
> There are several issues related to Launchpad's current problems with
> pulling data from incompletely-reconciled repositories. For one thing,
> we're incorrectly predicting which records need to be inserted. For
> another, we're inserting records that we already have, which wastes disk
> space, and causes us to raise exceptions if the record metadata differs.
>
> This patch allows bzr to avoid installing any records which are already
> present, which seems like the most robust fix. It would also be nice to
> investigate why we're sending too many records.

review -1
There are two consequences of the line you've added:
- we'll lookup everything we are fetching in the target repository
- incorrect data in one location will be sticky there and no warning
will be given

From a performance point of view, looking up things we don't have is
pretty much worst case: we'll have to hit every index, and read to the
bottom page, most of the time, to show we don't have it. We got
significant performance wins by removing or batching similar calls in
knit.py over the last couple of years. We could address the incorrect
data being sticky and not warning by doing a warning rather than just
continuing, however it won't fix the performance impact.

Now, add_records in the same file, *does* do a lookup for the same data,
but it does it in a batch which means that we don't run the risk of
cache thrashing on large pulls, *and* its controllable via the random_id
flag: we'll hit every backing index once and only once, and we don't pay
that overhead at all during commit.

I'd also _really_ prefer for us to just fix the data rather than working
around it in push and pull. The more we can work on the basis of the
data being correct, the leaner and faster our code can be.

https://bugs.edge.launchpad.net/bzr/+bug/390563 is dealing with too much
data being sent in push/pull operations and may well have significant
bearing with what you're observing here.

-Rob

On Wed, 2009-06-24 at 14:50 +0000, Aaron Bentley wrote:
> Aaron Bentley has proposed merging lp:~abentley/bzr/skip-dupes into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi all,
> 
> There are several issues related to Launchpad's current problems with
> pulling data from incompletely-reconciled repositories.  For one thing,
> we're incorrectly predicting which records need to be inserted.  For
> another, we're inserting records that we already have, which wastes disk
> space, and causes us to raise exceptions if the record metadata differs.
> 
> This patch allows bzr to avoid installing any records which are already
> present, which seems like the most robust fix.  It would also be nice to
> investigate why we're sending too many records.

review -1
There are two consequences of the line you've added:
 - we'll lookup everything we are fetching in the target repository
 - incorrect data in one location will be sticky there and no warning
will be given

From a performance point of view, looking up things we don't have is
pretty much worst case: we'll have to hit every index, and read to the
bottom page, most of the time, to show we don't have it. We got
significant performance wins by removing or batching similar calls in
knit.py over the last couple of years. We could address the incorrect
data being sticky and not warning by doing a warning rather than just
continuing, however it won't fix the performance impact.

Now, add_records in the same file, *does* do a lookup for the same data,
but it does it in a batch which means that we don't run the risk of
cache thrashing on large pulls, *and* its controllable via the random_id
flag: we'll hit every backing index once and only once, and we don't pay
that overhead at all during commit.

I'd also _really_ prefer for us to just fix the data rather than working
around it in push and pull. The more we can work on the basis of the
data being correct, the leaner and faster our code can be.

https://bugs.edge.launchpad.net/bzr/+bug/390563 is dealing with too much
data being sent in push/pull operations and may well have significant
bearing with what you're observing here.

-Rob

review: Disapprove

Revision history for this message

Aaron Bentley (abentley) wrote on 2009-06-25: Posted in a previous version of this proposal

#

Download full text (3.8 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> review -1
> There are two consequences of the line you've added:
> - we'll lookup everything we are fetching in the target repository

We were already looking up everything we were fetching in the target
repository.

> - incorrect data in one location will be sticky there and no warning
> will be given

We were already ignoring irrelevant incorrect data most of the time.
With this change, we would ignore it all the time.

>>From a performance point of view, looking up things we don't have is
> pretty much worst case: we'll have to hit every index, and read to the
> bottom page, most of the time, to show we don't have it. We got
> significant performance wins by removing or batching similar calls in
> knit.py over the last couple of years.

When fetching the last 100 revisions of bzr.dev into a treeless branch,
my patched version was actually faster, in our standard best-of-3 test:

skip-dupes:
real 0m4.431s
user 0m3.808s
sys 0m0.196s

bzr.dev:
real 0m4.439s
user 0m3.788s
sys 0m0.208s

I conclude that pull times are noisy enough that the inefficiency in my
version is not observable. Performance is not a concern.

> Now, add_records in the same file, *does* do a lookup for the same data,
> but it does it in a batch which means that we don't run the risk of
> cache thrashing on large pulls, *and* its controllable via the random_id
> flag: we'll hit every backing index once and only once, and we don't pay
> that overhead at all during commit.

I'd love to do a batched lookup in
GroupCompressVersionedFiles._insert_record_stream, but the API makes
that very difficult. It would be nice if the API told us, possibly in
batches, what records it was about to send.

I'm happy to disable the dupe-skipping code when random_id is True.

> I'd also _really_ prefer for us to just fix the data rather than working
> around it in push and pull.

I'm not really sure what you mean. In my case, my repository is
reconciled, and I'm trying to pull correct data from
lp:~launchpad-pqm/launchpad/devel. This bug prevented that, and no
random launchpad developer can reasonably be expected to reconcile the
master copy of lp:~launchpad-pqm/launchpad/devel.

Because push and pull are acting as a barrier, preventing pure and
impure repositories from interoperating, we have to delay reconciling
lp:~launchpad-pqm/launchpad/devel. If we were to reconcile it now, that
could break all other launchpad devs until they, too, reconciled.

Because push and pull act as a barrier and reconcile takes 3+ days to
run, we would have to stop all PQM commits for 3+ days in order to
reconcile lp:~launchpad-pqm/launchpad/devel. If we were to take a
mirror of lp:~launchpad-pqm/launchpad/devel and reconcile that, we might
not be able to merge in the commits that had happened while reconcile
was running, even though those commit would be correct data.

> The more we can work on the basis of the
> data being correct, the leaner and faster our code can be.

Yes, but this was not any kind of carefully-designed, systematic test.
The only reason it happens is because we're not correctly determining
which records to send. A...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
>  review -1
> There are two consequences of the line you've added:
>  - we'll lookup everything we are fetching in the target repository

We were already looking up everything we were fetching in the target
repository.

>  - incorrect data in one location will be sticky there and no warning
> will be given

We were already ignoring irrelevant incorrect data most of the time.
With this change, we would ignore it all the time.

>>From a performance point of view, looking up things we don't have is
> pretty much worst case: we'll have to hit every index, and read to the
> bottom page, most of the time, to show we don't have it. We got
> significant performance wins by removing or batching similar calls in
> knit.py over the last couple of years.

When fetching the last 100 revisions of bzr.dev into a treeless branch,
my patched version was actually faster, in our standard best-of-3 test:

skip-dupes:
real	0m4.431s
user	0m3.808s
sys	0m0.196s

bzr.dev:
real	0m4.439s
user	0m3.788s
sys	0m0.208s

I conclude that pull times are noisy enough that the inefficiency in my
version is not observable.  Performance is not a concern.

> Now, add_records in the same file, *does* do a lookup for the same data,
> but it does it in a batch which means that we don't run the risk of
> cache thrashing on large pulls, *and* its controllable via the random_id
> flag: we'll hit every backing index once and only once, and we don't pay
> that overhead at all during commit.

I'd love to do a batched lookup in
GroupCompressVersionedFiles._insert_record_stream,  but the API makes
that very difficult.  It would be nice if the API told us, possibly in
batches, what records it was about to send.

I'm happy to disable the dupe-skipping code when random_id is True.

> I'd also _really_ prefer for us to just fix the data rather than working
> around it in push and pull.

I'm not really sure what you mean.  In my case, my repository is
reconciled, and I'm trying to pull correct data from
lp:~launchpad-pqm/launchpad/devel.  This bug prevented that, and no
random launchpad developer can reasonably be expected to reconcile the
master copy of lp:~launchpad-pqm/launchpad/devel.

Because push and pull are acting as a barrier, preventing pure and
impure repositories from interoperating, we have to delay reconciling
lp:~launchpad-pqm/launchpad/devel.  If we were to reconcile it now, that
could break all other launchpad devs until they, too, reconciled.

Because push and pull act as a barrier and reconcile takes 3+ days to
run, we would have to stop all PQM commits for 3+ days in order to
reconcile lp:~launchpad-pqm/launchpad/devel.  If we were to take a
mirror of lp:~launchpad-pqm/launchpad/devel and reconcile that, we might
not be able to merge in the commits that had happened while reconcile
was running, even though those commit would be correct data.

> The more we can work on the basis of the
> data being correct, the leaner and faster our code can be.

Yes, but this was not any kind of carefully-designed, systematic test.
The only reason it happens is because we're not correctly determining
which records to send.  And it only happens when one side has
reconciled.  If nobody reconciles, no one notices a problem.  So in a
practical sense, it does not help us work on the basis of the data being
correct.

In any case, it appears the faulty revisions were committed with bzr
1.15, so flawed data is a reality with even very recent releases of bzr.
 We cannot depend on data being correct.  We cannot break horribly when
it is not.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpCzI4ACgkQ0F+nu1YWqI3hjACfSlwpMgg8OSXQGPfrX972DiVC
B8MAn1HWcipsC+6N/doIhUTDOINPNkrE
=Cduu
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-25: Posted in a previous version of this proposal

#

Download full text (5.4 KiB)

On Thu, 2009-06-25 at 01:03 +0000, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Robert Collins wrote:
> > review -1
> > There are two consequences of the line you've added:
> > - we'll lookup everything we are fetching in the target repository
>
> We were already looking up everything we were fetching in the target
> repository.

With a different access pattern, yes. However thats arguably a bug, we
shouldn't need to look up CHK pages at all, as all their ids are
effectively random.

> > - incorrect data in one location will be sticky there and no warning
> > will be given
>
> We were already ignoring irrelevant incorrect data most of the time.
> With this change, we would ignore it all the time.

We were? Could you enlarge on this. We do ignore data that we already
have - is that what you mean?

> >>From a performance point of view, looking up things we don't have is
> > pretty much worst case: we'll have to hit every index, and read to the
> > bottom page, most of the time, to show we don't have it. We got
> > significant performance wins by removing or batching similar calls in
> > knit.py over the last couple of years.
>
> When fetching the last 100 revisions of bzr.dev into a treeless branch,
> my patched version was actually faster, in our standard best-of-3 test:
>
> skip-dupes:
> real 0m4.431s
> user 0m3.808s
> sys 0m0.196s
>
> bzr.dev:
> real 0m4.439s
> user 0m3.788s
> sys 0m0.208s
>
> I conclude that pull times are noisy enough that the inefficiency in my
> version is not observable. Performance is not a concern.

Its not in that test/scale. However, larger pulls on larger trees may
well show different results. bzr itself is small in the scale of
projects that are using bzr now.

> > Now, add_records in the same file, *does* do a lookup for the same data,
> > but it does it in a batch which means that we don't run the risk of
> > cache thrashing on large pulls, *and* its controllable via the random_id
> > flag: we'll hit every backing index once and only once, and we don't pay
> > that overhead at all during commit.
>
> I'd love to do a batched lookup in
> GroupCompressVersionedFiles._insert_record_stream, but the API makes
> that very difficult. It would be nice if the API told us, possibly in
> batches, what records it was about to send.

Another way to structure this is to not error when add_records detects a
duplicate: just don't index it. The content will be redundant in the
group its in, but not looked up from there. (And if the group is empty
we could discard it). That would mean that only a single lookup happens
and would address the performance concerns I have.

> I'm happy to disable the dupe-skipping code when random_id is True.
>
> > I'd also _really_ prefer for us to just fix the data rather than working
> > around it in push and pull.
>
> I'm not really sure what you mean. In my case, my repository is
> reconciled, and I'm trying to pull correct data from
> lp:~launchpad-pqm/launchpad/devel. This bug prevented that, and no
> random launchpad developer can reasonably be expected to reconcile the
> master copy of lp:~launchpad-pqm/launchpad/devel.

The master copy was meant...

On Thu, 2009-06-25 at 01:03 +0000, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> >  review -1
> > There are two consequences of the line you've added:
> >  - we'll lookup everything we are fetching in the target repository
> 
> We were already looking up everything we were fetching in the target
> repository.

With a different access pattern, yes. However thats arguably a bug, we
shouldn't need to look up CHK pages at all, as all their ids are
effectively random.

> >  - incorrect data in one location will be sticky there and no warning
> > will be given
> 
> We were already ignoring irrelevant incorrect data most of the time.
> With this change, we would ignore it all the time.

We were? Could you enlarge on this. We do ignore data that we already
have - is that what you mean?

> >>From a performance point of view, looking up things we don't have is
> > pretty much worst case: we'll have to hit every index, and read to the
> > bottom page, most of the time, to show we don't have it. We got
> > significant performance wins by removing or batching similar calls in
> > knit.py over the last couple of years.
> 
> When fetching the last 100 revisions of bzr.dev into a treeless branch,
> my patched version was actually faster, in our standard best-of-3 test:
> 
> skip-dupes:
> real	0m4.431s
> user	0m3.808s
> sys	0m0.196s
> 
> bzr.dev:
> real	0m4.439s
> user	0m3.788s
> sys	0m0.208s
> 
> I conclude that pull times are noisy enough that the inefficiency in my
> version is not observable.  Performance is not a concern.

Its not in that test/scale. However, larger pulls on larger trees may
well show different results. bzr itself is small in the scale of
projects that are using bzr now.

> > Now, add_records in the same file, *does* do a lookup for the same data,
> > but it does it in a batch which means that we don't run the risk of
> > cache thrashing on large pulls, *and* its controllable via the random_id
> > flag: we'll hit every backing index once and only once, and we don't pay
> > that overhead at all during commit.
> 
> I'd love to do a batched lookup in
> GroupCompressVersionedFiles._insert_record_stream,  but the API makes
> that very difficult.  It would be nice if the API told us, possibly in
> batches, what records it was about to send.

Another way to structure this is to not error when add_records detects a
duplicate: just don't index it. The content will be redundant in the
group its in, but not looked up from there. (And if the group is empty
we could discard it). That would mean that only a single lookup happens
and would address the performance concerns I have.

> I'm happy to disable the dupe-skipping code when random_id is True.
> 
> > I'd also _really_ prefer for us to just fix the data rather than working
> > around it in push and pull.
> 
> I'm not really sure what you mean.  In my case, my repository is
> reconciled, and I'm trying to pull correct data from
> lp:~launchpad-pqm/launchpad/devel.  This bug prevented that, and no
> random launchpad developer can reasonably be expected to reconcile the
> master copy of lp:~launchpad-pqm/launchpad/devel.

The master copy was meant to be reconciled *before* migrating; and all
user repositories were also meant to be reconciled.

> Because push and pull are acting as a barrier, preventing pure and
> impure repositories from interoperating, we have to delay reconciling
> lp:~launchpad-pqm/launchpad/devel.  If we were to reconcile it now, that
> could break all other launchpad devs until they, too, reconciled.
> 
> Because push and pull act as a barrier and reconcile takes 3+ days to
> run, we would have to stop all PQM commits for 3+ days in order to
> reconcile lp:~launchpad-pqm/launchpad/devel.  If we were to take a
> mirror of lp:~launchpad-pqm/launchpad/devel and reconcile that, we might
> not be able to merge in the commits that had happened while reconcile
> was running, even though those commit would be correct data.

check and reconcile performance are on the todo list, as high priority
items; right now the bugs with stacked branches are even higher, as they
prevent repositories that are consistent interoperating.

> > The more we can work on the basis of the
> > data being correct, the leaner and faster our code can be.
> 
> Yes, but this was not any kind of carefully-designed, systematic test.
> The only reason it happens is because we're not correctly determining
> which records to send.  And it only happens when one side has
> reconciled.  If nobody reconciles, no one notices a problem.  So in a
> practical sense, it does not help us work on the basis of the data being
> correct.
> 
> In any case, it appears the faulty revisions were committed with bzr
> 1.15, so flawed data is a reality with even very recent releases of bzr.
>  We cannot depend on data being correct.  We cannot break horribly when
> it is not.

I don't recall any changes to file parent graphs calculations being made
from bzr 1.15->1.16. The only change that I recall is the rich-root
parent calculation was altered; but that was done before any of the 2a
supporting bzr's, so it shouldn't be possible for reconcile to give
different answers there.

Another possibility is that the inconsistency check is actually buggy,
or that the revisions that were different were reconciled in a stacked
repository (where we don't correct or truncate data at the edge of the
graph) - have you investigated that at all?

-Rob

Revision history for this message

Aaron Bentley (abentley) wrote on 2009-06-25: Posted in a previous version of this proposal

#

Download full text (5.2 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2009-06-25 at 01:03 +0000, Aaron Bentley wrote:
>> We were already looking up everything we were fetching in the target
>> repository.
>
> With a different access pattern, yes. However thats arguably a bug, we
> shouldn't need to look up CHK pages at all, as all their ids are
> effectively random.

So you're saying that we should be looking up all records except CHK
records? Do you mean that their ids vary across repositories? I did
not think that was the case.

>>> - incorrect data in one location will be sticky there and no warning
>>> will be given
>> We were already ignoring irrelevant incorrect data most of the time.
>> With this change, we would ignore it all the time.
>
> We were? Could you enlarge on this. We do ignore data that we already
> have - is that what you mean?

That is what I mean.

>> I conclude that pull times are noisy enough that the inefficiency in my
>> version is not observable. Performance is not a concern.
>
> Its not in that test/scale. However, larger pulls on larger trees may
> well show different results. bzr itself is small in the scale of
> projects that are using bzr now.

I've based my statements about performance on benchmarking. Please base
your statements about performance on benchmarking.

>> I'd love to do a batched lookup in
>> GroupCompressVersionedFiles._insert_record_stream, but the API makes
>> that very difficult. It would be nice if the API told us, possibly in
>> batches, what records it was about to send.
>
> Another way to structure this is to not error when add_records detects a
> duplicate: just don't index it. The content will be redundant in the
> group its in, but not looked up from there.

I thought you wanted repository indices to be fully reconstructible from
the data in the repository itself. Having two entries for a node, with
one deliberately omitted from the index is not a situation we could
reconstruct from the repository alone.

> That would mean that only a single lookup happens
> and would address the performance concerns I have.

I'd be willing to fix it that way.

>> lp:~launchpad-pqm/launchpad/devel. This bug prevented that, and no
>> random launchpad developer can reasonably be expected to reconcile the
>> master copy of lp:~launchpad-pqm/launchpad/devel.
>
> The master copy was meant to be reconciled *before* migrating;

It was. Then bad commits were pushed into it:
<email address hidden>
<email address hidden>
<email address hidden>

> and all
> user repositories were also meant to be reconciled.

Mine was. That's what made its versions of those revisions incompatible
with the versions in launchpad trunk.

>> Because push and pull act as a barrier and reconcile takes 3+ days to
>> run, we would have to stop all PQM commits for 3+ days in order to
>> reconcile lp:~launchpad-pqm/launchpad/devel. If we were to take a
>> mirror of lp:~launchpad-pqm/launchpad/devel and reconcile that, we might
>> not be able to merge in the commits that had happened while reconcile
>> was ru...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2009-06-25 at 01:03 +0000, Aaron Bentley wrote:
>> We were already looking up everything we were fetching in the target
>> repository.
> 
> With a different access pattern, yes. However thats arguably a bug, we
> shouldn't need to look up CHK pages at all, as all their ids are
> effectively random.

So you're saying that we should be looking up all records except CHK
records?  Do you mean that their ids vary across repositories?  I did
not think that was the case.

>>>  - incorrect data in one location will be sticky there and no warning
>>> will be given
>> We were already ignoring irrelevant incorrect data most of the time.
>> With this change, we would ignore it all the time.
> 
> We were? Could you enlarge on this. We do ignore data that we already
> have - is that what you mean?

That is what I mean.

>> I conclude that pull times are noisy enough that the inefficiency in my
>> version is not observable.  Performance is not a concern.
> 
> Its not in that test/scale. However, larger pulls on larger trees may
> well show different results. bzr itself is small in the scale of
> projects that are using bzr now.

I've based my statements about performance on benchmarking.  Please base
your statements about performance on benchmarking.

>> I'd love to do a batched lookup in
>> GroupCompressVersionedFiles._insert_record_stream,  but the API makes
>> that very difficult.  It would be nice if the API told us, possibly in
>> batches, what records it was about to send.
> 
> Another way to structure this is to not error when add_records detects a
> duplicate: just don't index it. The content will be redundant in the
> group its in, but not looked up from there.

I thought you wanted repository indices to be fully reconstructible from
the data in the repository itself.  Having two entries for a node, with
one deliberately omitted from the index is not a situation we could
reconstruct from the repository alone.

> That would mean that only a single lookup happens
> and would address the performance concerns I have.

I'd be willing to fix it that way.

>> lp:~launchpad-pqm/launchpad/devel.  This bug prevented that, and no
>> random launchpad developer can reasonably be expected to reconcile the
>> master copy of lp:~launchpad-pqm/launchpad/devel.
> 
> The master copy was meant to be reconciled *before* migrating;

It was.  Then bad commits were pushed into it:
abel.deuring@canonical.com-20090612163602-lz5zk5qep4qhqwjk
abel.deuring@canonical.com-20090602082049-fiy6iz2mijxpwlge
abel.deuring@canonical.com-20090612090543-dz2kjihx7jnbj0f6

> and all
> user repositories were also meant to be reconciled.

Mine was.  That's what made its versions of those revisions incompatible
with the versions in launchpad trunk.

>> Because push and pull act as a barrier and reconcile takes 3+ days to
>> run, we would have to stop all PQM commits for 3+ days in order to
>> reconcile lp:~launchpad-pqm/launchpad/devel.  If we were to take a
>> mirror of lp:~launchpad-pqm/launchpad/devel and reconcile that, we might
>> not be able to merge in the commits that had happened while reconcile
>> was running, even though those commit would be correct data.
> 
> check and reconcile performance are on the todo list, as high priority
> items; right now the bugs with stacked branches are even higher, as they
> prevent repositories that are consistent interoperating.

I appreciate that there is lots of work.

The situation would improve greatly if reconcile could run in 8 hours or
less.  More than 24 hours is still a problem.

Still, the bug this branch fixes is preventing us from reconciling a
mirror of trunk and pulling in any new revisions from the live
(partially reconciled) copy.

>> In any case, it appears the faulty revisions were committed with bzr
>> 1.15, so flawed data is a reality with even very recent releases of bzr.
>>  We cannot depend on data being correct.  We cannot break horribly when
>> it is not.
> 
> I don't recall any changes to file parent graphs calculations being made
> from bzr 1.15->1.16.

The bad revisions were comitted by abel, and he uses the beta PPA.  At
the time the revisions were committed, he had whatever came before
1.6rc1 in the beta PPA.

> The only change that I recall is the rich-root
> parent calculation was altered; but that was done before any of the 2a
> supporting bzr's, so it shouldn't be possible for reconcile to give
> different answers there.

This is not a case of reconcile giving different answers.

> Another possibility is that the inconsistency check is actually buggy,

Jam helped me manually verify the differing parent data about texts.

> or that the revisions that were different were reconciled in a stacked
> repository

My repository was definitely not stacked.  I strongly suspect thumper's
copy of launchpad trunk was not stacked when he reconciled it, but
that's irrelevant, since it was reconciled before the buggy revisions
were added.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpC5qQACgkQ0F+nu1YWqI1X0gCfQmgVXSiNytHlSPD6RP6nrX9Q
8tIAnjTioy7VlNRcskrgQpcXuhRz9x3G
=29/6
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-25: Posted in a previous version of this proposal

#

One further note that is relevant I think.

John, Vincent and I are closing in on a fix for bug 350563. Once this is
fixed, duplicate texts won't be pulled across and I think your headaches
with fetch() will go away. The current status is that there is a branch
(lp:~lifeless/bzr/bug-350563) which has a partial fix - one that should
cover all the cases you're running into in launchpad, but won't cover
some more esoteric cases to do with the internal structure of CHK page
layout.

I'd appreciate it if you could try pulling with this branch and see if
it stops you having errors.

-Rob

Revision history for this message

Aaron Bentley (abentley) wrote on 2009-06-25: Posted in a previous version of this proposal

#

Per Robert's request, this revised version handles duplicates by not adding the records to the index.
It turns out that this was already the behaviour, only it would raise an exception if the duplicate record was inconsistent.

So this patch changes the error to a warning, which is also in line with Robert's wishes. It also adds the key of the inconsistent record to the message.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-25: Posted in a previous version of this proposal

#

Thank you for making these changes. I was still nervous about things
going wrong - stepping back from paranoia isn't easy.

If you can do the following, I'll feel comfortable that things won't go
badly wrong *and* get ignored by users.

Add some sort of flag to the state of the GroupCompressVersionedFiles
object indicating 'trace or raise'. Then in the GCPackRepository
__init__, set the flag as follows:
revisions: raise
inventories: trace
texts: trace
signatures: trace
chk_bytes: trace

This will mean that if the revisions graph is wrong we stop cold - which
is important because that affects fetch. The other graphs don't affect
fetch and are consequently a lower priorityIMO.

-Rob

Revision history for this message

Aaron Bentley (abentley) wrote on 2009-06-26:

#

Here is a new version, revised per Robert's request, so that inconsistent unwanted revisions are fatal, while all others generate a warning.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-27:

#

review +1

Other tests for '2a' are in the test_repository.py file; you might like
to put your new one in the Test2a class in there for consistency. Other
than that optional change - looks good, please land.

-Rob

review: Approve

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-27:

#

review +1

Other tests for '2a' are in the test_repository.py file; you might like
to put your new one in the Test2a class in there for consistency. Other
than that optional change - looks good, please land.

-Rob

review: Approve

Revision history for this message

Aaron Bentley (abentley) wrote on 2009-06-29:

#

> review +1
>
> Other tests for '2a' are in the test_repository.py file; you might like
> to put your new one in the Test2a class in there for consistency. Other
> than that optional change - looks good, please land.

Cool, thanks.

I think having a test_groupcompress_repo would be more discoverable, but I'll move the tests to test_repository.

Bazaar

Merge lp:~abentley/bzr/skip-dupes into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'NEWS'
 --- NEWS	2009-06-26 23:28:46 +0000
 +++ NEWS	2009-06-29 15:35:12 +0000
@@ -50,6 +50,9 @@
  * ``bzr ls DIR --from-root`` now shows only things in DIR, not everything.
    (Ian Clatworthy)
++* Fetch between repositories does not error if they have inconsistent data
++  that should be irrelevant to the fetch operation. (Aaron Bentley)
++
  * Progress bars are now suppressed again when the environment variable
    ``BZR_PROGRESS_BAR`` is set to ``none``.
    (Martin Pool, #339385)
 === modified file 'bzrlib/groupcompress.py'
 --- bzrlib/groupcompress.py	2009-06-23 15:27:50 +0000
 +++ bzrlib/groupcompress.py	2009-06-29 15:35:12 +0000
@@ -942,7 +942,7 @@
          self.endpoint = endpoint
--def make_pack_factory(graph, delta, keylength):
++def make_pack_factory(graph, delta, keylength, inconsistency_fatal=True):
      """Create a factory for creating a pack based groupcompress.
      This is only functional enough to run interface tests, it doesn't try to
@@ -963,7 +963,8 @@
          writer = pack.ContainerWriter(stream.write)
          writer.begin()
          index = _GCGraphIndex(graph_index, lambda:True, parents=parents,
--            add_callback=graph_index.add_nodes)
++            add_callback=graph_index.add_nodes,
++            inconsistency_fatal=inconsistency_fatal)
          access = knit._DirectPackAccess({})
          access.set_writer(writer, graph_index, (transport, 'newpack'))
          result = GroupCompressVersionedFiles(index, access, delta)
@@ -1610,7 +1611,8 @@
      """Mapper from GroupCompressVersionedFiles needs into GraphIndex storage."""
      def __init__(self, graph_index, is_locked, parents=True,
--        add_callback=None, track_external_parent_refs=False):
++        add_callback=None, track_external_parent_refs=False,
++        inconsistency_fatal=True):
          """Construct a _GCGraphIndex on a graph_index.
          :param graph_index: An implementation of bzrlib.index.GraphIndex.
@@ -1624,12 +1626,17 @@
          :param track_external_parent_refs: As keys are added, keep track of the
              keys they reference, so that we can query get_missing_parents(),
              etc.
++        :param inconsistency_fatal: When asked to add records that are already
++            present, and the details are inconsistent with the existing
++            record, raise an exception instead of warning (and skipping the
++            record).
          """
          self._add_callback = add_callback
          self._graph_index = graph_index
          self._parents = parents
          self.has_graph = parents
          self._is_locked = is_locked
++        self._inconsistency_fatal = inconsistency_fatal
          if track_external_parent_refs:
              self._key_dependencies = knit._KeyRefs()
          else:
@@ -1671,8 +1678,14 @@
              present_nodes = self._get_entries(keys)
              for (index, key, value, node_refs) in present_nodes:
                  if node_refs != keys[key][1]:
--                    raise errors.KnitCorrupt(self, "inconsistent details in add_records"
--                        ": %s %s" % ((value, node_refs), keys[key]))
++                    details = '%s %s %s' % (key, (value, node_refs), keys[key])
++                    if self._inconsistency_fatal:
++                        raise errors.KnitCorrupt(self, "inconsistent details"
++                                                 " in add_records: %s" %
++                                                 details)
++                    else:
++                        trace.warning("inconsistent details in skipped"
++                                      " record: %s", details)
                  del keys[key]
                  changed = True
          if changed:
 === modified file 'bzrlib/repofmt/groupcompress_repo.py'
 --- bzrlib/repofmt/groupcompress_repo.py	2009-06-26 09:24:34 +0000
 +++ bzrlib/repofmt/groupcompress_repo.py	2009-06-29 15:35:12 +0000
@@ -622,7 +622,8 @@
          self.inventories = GroupCompressVersionedFiles(
              _GCGraphIndex(self._pack_collection.inventory_index.combined_index,
                  add_callback=self._pack_collection.inventory_index.add_callback,
--                parents=True, is_locked=self.is_locked),
++                parents=True, is_locked=self.is_locked,
++                inconsistency_fatal=False),
              access=self._pack_collection.inventory_index.data_access)
          self.revisions = GroupCompressVersionedFiles(
              _GCGraphIndex(self._pack_collection.revision_index.combined_index,
@@ -634,19 +635,22 @@
          self.signatures = GroupCompressVersionedFiles(
              _GCGraphIndex(self._pack_collection.signature_index.combined_index,
                  add_callback=self._pack_collection.signature_index.add_callback,
--                parents=False, is_locked=self.is_locked),
++                parents=False, is_locked=self.is_locked,
++                inconsistency_fatal=False),
              access=self._pack_collection.signature_index.data_access,
              delta=False)
          self.texts = GroupCompressVersionedFiles(
              _GCGraphIndex(self._pack_collection.text_index.combined_index,
                  add_callback=self._pack_collection.text_index.add_callback,
--                parents=True, is_locked=self.is_locked),
++                parents=True, is_locked=self.is_locked,
++                inconsistency_fatal=False),
              access=self._pack_collection.text_index.data_access)
          # No parents, individual CHK pages don't have specific ancestry
          self.chk_bytes = GroupCompressVersionedFiles(
              _GCGraphIndex(self._pack_collection.chk_index.combined_index,
                  add_callback=self._pack_collection.chk_index.add_callback,
--                parents=False, is_locked=self.is_locked),
++                parents=False, is_locked=self.is_locked,
++                inconsistency_fatal=False),
              access=self._pack_collection.chk_index.data_access)
          # True when the repository object is 'write locked' (as opposed to the
          # physical lock only taken out around changes to the pack-names list.)
 === modified file 'bzrlib/tests/test_config.py'
 --- bzrlib/tests/test_config.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/tests/test_config.py	2009-06-29 15:35:12 +0000
@@ -18,7 +18,6 @@
  """Tests for finding and reading the bzr config file[s]."""
  # import system imports here
  from cStringIO import StringIO
--import getpass
  import os
  import sys
 === modified file 'bzrlib/tests/test_groupcompress.py'
 --- bzrlib/tests/test_groupcompress.py	2009-06-22 18:30:08 +0000
 +++ bzrlib/tests/test_groupcompress.py	2009-06-29 15:35:12 +0000
@@ -25,6 +25,7 @@
      index as _mod_index,
      osutils,
      tests,
++    trace,
      versionedfile,
+     )
  from bzrlib.osutils import sha_string
@@ -474,11 +475,12 @@
  class TestCaseWithGroupCompressVersionedFiles(tests.TestCaseWithTransport):
      def make_test_vf(self, create_graph, keylength=1, do_cleanup=True,
--                     dir='.'):
++                     dir='.', inconsistency_fatal=True):
          t = self.get_transport(dir)
          t.ensure_base()
          vf = groupcompress.make_pack_factory(graph=create_graph,
--            delta=False, keylength=keylength)(t)
++            delta=False, keylength=keylength,
++            inconsistency_fatal=inconsistency_fatal)(t)
          if do_cleanup:
              self.addCleanup(groupcompress.cleanup_pack_group, vf)
          return vf
@@ -658,6 +660,47 @@
              frozenset([('parent-1',), ('parent-2',)]),
              index.get_missing_parents())
++    def make_source_with_b(self, a_parent, path):
++        source = self.make_test_vf(True, dir=path)
++        source.add_lines(('a',), (), ['lines\n'])
++        if a_parent:
++            b_parents = (('a',),)
++        else:
++            b_parents = ()
++        source.add_lines(('b',), b_parents, ['lines\n'])
++        return source
++
++    def do_inconsistent_inserts(self, inconsistency_fatal):
++        target = self.make_test_vf(True, dir='target',
++                                   inconsistency_fatal=inconsistency_fatal)
++        for x in range(2):
++            source = self.make_source_with_b(x==1, 'source%s' % x)
++            target.insert_record_stream(source.get_record_stream(
++                [('b',)], 'unordered', False))
++
++    def test_inconsistent_redundant_inserts_warn(self):
++        """Should not insert a record that is already present."""
++        warnings = []
++        def warning(template, args):
++            warnings.append(template % args)
++        _trace_warning = trace.warning
++        trace.warning = warning
++        try:
++            self.do_inconsistent_inserts(inconsistency_fatal=False)
++        finally:
++            trace.warning = _trace_warning
++        self.assertEqual(["inconsistent details in skipped record: ('b',)"
++                          " ('42 32 0 8', ((),)) ('74 32 0 8', ((('a',),),))"],
++                         warnings)
++
++    def test_inconsistent_redundant_inserts_raises(self):
++        e = self.assertRaises(errors.KnitCorrupt, self.do_inconsistent_inserts,
++                              inconsistency_fatal=True)
++        self.assertContainsRe(str(e), "Knit.* corrupt: inconsistent details"
++                              " in add_records:"
++                              " \('b',\) \('42 32 0 8', \(\(\),\)\) \('74 32"
++                              " 0 8', \(\(\('a',\),\),\)\)")
++
  class TestLazyGroupCompress(tests.TestCaseWithTransport):
 === modified file 'bzrlib/tests/test_repository.py'
 --- bzrlib/tests/test_repository.py	2009-06-26 09:24:34 +0000
 +++ bzrlib/tests/test_repository.py	2009-06-29 15:35:12 +0000
@@ -797,6 +797,14 @@
          self.assertEqual(257, len(full_chk_records))
          self.assertSubset(simple_chk_records, full_chk_records)
++    def test_inconsistency_fatal(self):
++        repo = self.make_repository('repo', format='2a')
++        self.assertTrue(repo.revisions._index._inconsistency_fatal)
++        self.assertFalse(repo.texts._index._inconsistency_fatal)
++        self.assertFalse(repo.inventories._index._inconsistency_fatal)
++        self.assertFalse(repo.signatures._index._inconsistency_fatal)
++        self.assertFalse(repo.chk_bytes._index._inconsistency_fatal)
++
  class TestKnitPackStreamSource(tests.TestCaseWithMemoryTransport):
 === modified file 'bzrlib/tests/test_ui.py'
 --- bzrlib/tests/test_ui.py	2009-06-17 05:16:48 +0000
 +++ bzrlib/tests/test_ui.py	2009-06-29 15:35:12 +0000
@@ -23,21 +23,15 @@
  import sys
  import time
--import bzrlib
--import bzrlib.errors as errors
++from bzrlib import (
++    errors,
++    tests,
++    ui as _mod_ui,
++    )
  from bzrlib.symbol_versioning import (
      deprecated_in,
+     )
--from bzrlib.tests import (
--    TestCase,
--    TestUIFactory,
--    StringIOWrapper,
--    )
  from bzrlib.tests.test_progress import _TTYStringIO
--from bzrlib.ui import (
--    CLIUIFactory,
--    SilentUIFactory,
--    )
  from bzrlib.ui.text import (
      NullProgressView,
      TextProgressView,
@@ -45,10 +39,10 @@
+     )
--class UITests(TestCase):
++class UITests(tests.TestCase):
      def test_silent_factory(self):
--        ui = SilentUIFactory()
++        ui = _mod_ui.SilentUIFactory()
          stdout = StringIO()
          self.assertEqual(None,
                           self.apply_redirected(None, stdout, stdout,
@@ -62,8 +56,9 @@
          self.assertEqual('', stdout.getvalue())
      def test_text_factory_ascii_password(self):
--        ui = TestUIFactory(stdin='secret\n', stdout=StringIOWrapper(),
--                           stderr=StringIOWrapper())
++        ui = tests.TestUIFactory(stdin='secret\n',
++                                 stdout=tests.StringIOWrapper(),
++                                 stderr=tests.StringIOWrapper())
          pb = ui.nested_progress_bar()
          try:
              self.assertEqual('secret',
@@ -84,9 +79,9 @@
          We can't predict what encoding users will have for stdin, so we force
          it to utf8 to test that we transport the password correctly.
          """
--        ui = TestUIFactory(stdin=u'baz\u1234'.encode('utf8'),
--                           stdout=StringIOWrapper(),
--                           stderr=StringIOWrapper())
++        ui = tests.TestUIFactory(stdin=u'baz\u1234'.encode('utf8'),
++                                 stdout=tests.StringIOWrapper(),
++                                 stderr=tests.StringIOWrapper())
          ui.stderr.encoding = ui.stdout.encoding = ui.stdin.encoding = 'utf8'
          pb = ui.nested_progress_bar()
          try:
@@ -194,11 +189,11 @@
          self.assertEqual('', factory.stdin.readline())
      def test_silent_ui_getbool(self):
--        factory = SilentUIFactory()
++        factory = _mod_ui.SilentUIFactory()
          self.assert_get_bool_acceptance_of_user_input(factory)
      def test_silent_factory_prompts_silently(self):
--        factory = SilentUIFactory()
++        factory = _mod_ui.SilentUIFactory()
          stdout = StringIO()
          factory.stdin = StringIO("y\n")
          self.assertEqual(True,
@@ -222,7 +217,8 @@
      def test_text_factory_prompts_and_clears(self):
          # a get_boolean call should clear the pb before prompting
          out = _TTYStringIO()
--        factory = TextUIFactory(stdin=StringIO("yada\ny\n"), stdout=out, stderr=out)
++        factory = TextUIFactory(stdin=StringIO("yada\ny\n"),
++                                stdout=out, stderr=out)
          pb = factory.nested_progress_bar()
          pb.show_bar = False
          pb.show_spinner = False
@@ -253,7 +249,7 @@
              pb.finished()
      def test_silent_ui_getusername(self):
--        factory = SilentUIFactory()
++        factory = _mod_ui.SilentUIFactory()
          factory.stdin = StringIO("someuser\n\n")
          factory.stdout = StringIO()
          factory.stderr = StringIO()
@@ -279,8 +275,9 @@
          self.assertEqual('', factory.stdin.readline())
      def test_text_ui_getusername_utf8(self):
--        ui = TestUIFactory(stdin=u'someuser\u1234'.encode('utf8'),
--                           stdout=StringIOWrapper(), stderr=StringIOWrapper())
++        ui = tests.TestUIFactory(stdin=u'someuser\u1234'.encode('utf8'),
++                                 stdout=tests.StringIOWrapper(),
++                                 stderr=tests.StringIOWrapper())
          ui.stderr.encoding = ui.stdout.encoding = ui.stdin.encoding = "utf8"
          pb = ui.nested_progress_bar()
          try:
@@ -295,7 +292,7 @@
              pb.finished()
--class TestTextProgressView(TestCase):
++class TestTextProgressView(tests.TestCase):
      """Tests for text display of progress bars.
      """
      # XXX: These might be a bit easier to write if the rendering and