On Thu, 2009-06-25 at 01:03 +0000, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> >  review -1
> > There are two consequences of the line you've added:
> >  - we'll lookup everything we are fetching in the target repository
> 
> We were already looking up everything we were fetching in the target
> repository.

With a different access pattern, yes. However thats arguably a bug, we
shouldn't need to look up CHK pages at all, as all their ids are
effectively random.

> >  - incorrect data in one location will be sticky there and no warning
> > will be given
> 
> We were already ignoring irrelevant incorrect data most of the time.
> With this change, we would ignore it all the time.

We were? Could you enlarge on this. We do ignore data that we already
have - is that what you mean?

> >>From a performance point of view, looking up things we don't have is
> > pretty much worst case: we'll have to hit every index, and read to the
> > bottom page, most of the time, to show we don't have it. We got
> > significant performance wins by removing or batching similar calls in
> > knit.py over the last couple of years.
> 
> When fetching the last 100 revisions of bzr.dev into a treeless branch,
> my patched version was actually faster, in our standard best-of-3 test:
> 
> skip-dupes:
> real	0m4.431s
> user	0m3.808s
> sys	0m0.196s
> 
> bzr.dev:
> real	0m4.439s
> user	0m3.788s
> sys	0m0.208s
> 
> I conclude that pull times are noisy enough that the inefficiency in my
> version is not observable.  Performance is not a concern.

Its not in that test/scale. However, larger pulls on larger trees may
well show different results. bzr itself is small in the scale of
projects that are using bzr now.

> > Now, add_records in the same file, *does* do a lookup for the same data,
> > but it does it in a batch which means that we don't run the risk of
> > cache thrashing on large pulls, *and* its controllable via the random_id
> > flag: we'll hit every backing index once and only once, and we don't pay
> > that overhead at all during commit.
> 
> I'd love to do a batched lookup in
> GroupCompressVersionedFiles._insert_record_stream,  but the API makes
> that very difficult.  It would be nice if the API told us, possibly in
> batches, what records it was about to send.

Another way to structure this is to not error when add_records detects a
duplicate: just don't index it. The content will be redundant in the
group its in, but not looked up from there. (And if the group is empty
we could discard it). That would mean that only a single lookup happens
and would address the performance concerns I have.

> I'm happy to disable the dupe-skipping code when random_id is True.
> 
> > I'd also _really_ prefer for us to just fix the data rather than working
> > around it in push and pull.
> 
> I'm not really sure what you mean.  In my case, my repository is
> reconciled, and I'm trying to pull correct data from
> lp:~launchpad-pqm/launchpad/devel.  This bug prevented that, and no
> random launchpad developer can reasonably be expected to reconcile the
> master copy of lp:~launchpad-pqm/launchpad/devel.

The master copy was meant to be reconciled *before* migrating; and all
user repositories were also meant to be reconciled. 

> Because push and pull are acting as a barrier, preventing pure and
> impure repositories from interoperating, we have to delay reconciling
> lp:~launchpad-pqm/launchpad/devel.  If we were to reconcile it now, that
> could break all other launchpad devs until they, too, reconciled.
> 
> Because push and pull act as a barrier and reconcile takes 3+ days to
> run, we would have to stop all PQM commits for 3+ days in order to
> reconcile lp:~launchpad-pqm/launchpad/devel.  If we were to take a
> mirror of lp:~launchpad-pqm/launchpad/devel and reconcile that, we might
> not be able to merge in the commits that had happened while reconcile
> was running, even though those commit would be correct data.

check and reconcile performance are on the todo list, as high priority
items; right now the bugs with stacked branches are even higher, as they
prevent repositories that are consistent interoperating.

> > The more we can work on the basis of the
> > data being correct, the leaner and faster our code can be.
> 
> Yes, but this was not any kind of carefully-designed, systematic test.
> The only reason it happens is because we're not correctly determining
> which records to send.  And it only happens when one side has
> reconciled.  If nobody reconciles, no one notices a problem.  So in a
> practical sense, it does not help us work on the basis of the data being
> correct.
> 
> In any case, it appears the faulty revisions were committed with bzr
> 1.15, so flawed data is a reality with even very recent releases of bzr.
>  We cannot depend on data being correct.  We cannot break horribly when
> it is not.

I don't recall any changes to file parent graphs calculations being made
from bzr 1.15->1.16. The only change that I recall is the rich-root
parent calculation was altered; but that was done before any of the 2a
supporting bzr's, so it shouldn't be possible for reconcile to give
different answers there.

Another possibility is that the inconsistency check is actually buggy,
or that the revisions that were different were reconciled in a stacked
repository (where we don't correct or truncate data at the edge of the
graph) - have you investigated that at all?

-Rob