John A Meinel wrote:
> Review: Needs Fixing
> To start with, I think your layering here is pretty nice. Adding an object
> does make managing the state a bit clearer. get_record_stream() is complex
> enough that it probably should have started out as an object interface rather
> than just a simple generator. Oh well.

Yes.  I'm glad I structured it this way, it feels quite nice to me once I
figured out which bits to move out of _get_remaining_record_stream.

> In short, I think we have a race condition because of how LRUSizeCache (or
> just LRUCache) interacts with _get_blocks() being in arbitrary ordering. And
> we need a different caching algorithm that can ensure requested blocks are
> never flushed from the cache before they are no longer needed.

Ok, I see that.  I've fixed it by making add_key grab and keep cached data
immediately; we know we're about to use it so it would be perverse to allow it
to fall out of LRUCache and then not be used.  This shifts some of the "plan
what to get" logic out of yield_factories to add_key, which I was considering
doing anyway.

> The comment here is no longer correct:
> +        for read_memo in read_memos:
> +            try:
> +                yield cached[read_memo]
> +            except KeyError:
> +                # read the group
> +                zdata = raw_records.next()
> +                # decompress - whole thing - this is not a bug, as it
> +                # permits caching. We might want to store the partially
> +                # decompresed group and decompress object, so that recent
> +                # texts are not penalised by big groups.
> +                block = GroupCompressBlock.from_bytes(zdata)
> +                self._group_cache[read_memo] = block
> +                cached[read_memo] = block
> +                yield block
> 
> ^- We are caching the 'block' and not the raw content anymore. And the block may only partially evaluate the compressed content.

Ok.  I've deleted that large, wrong comment and just put this at the top of the
except block:

                # Read the block, and cache it.

> However, I'm more concerned that in this loop:
[...]
> 
> ^- There is an assumption that raw_records is in the exact ordering of 'read_memos'.

I've added the assertion you suggest below (without an 'assert' statement, of
course.)

> And I think there is another small assumption in here, which is that the code
> that filters out duplicates is going to require that groups always perfectly
> fit in cache and aren't expired before you get to the duplicate. Perhaps an
> example:
[...]
> Even more worrisome, is "large groups" which may never get put into the cache
> in the first place. (LRUSizeCache says "if a request is larger than my size,
> don't cache it".)
> 
> I think we'll never run into these bugs in test data, but we'll see them 'in
> the wild' once we have data that may not fit well in the cache.

Yeah, I think you're right.  I wish that we could make the test data encounter
these cases as much as wild data...

> So I think what we really need is a different caching logic. Namely that
> "_get_blocks()" could keep a counter for how many times it needs a given
> block, and doesn't flush the block out of the cache until that value is
> satisfied. In addition, it is probably reasonable to continue to cache things
> in gcvf._cache. I'm not sure about the LRU effects, and whether we should
> always make a request on the _cache to keep it in sync with the actual
> requests...

What I've done as mentioned above is actually retrieve and keep the block on
_BatchingBlockFetcher at add_key time.  As new blocks are received from
_get_blocks they are also stored on _BatchingBlockFetcher.  So each block, once
retrieved, whether via cache or _get_blocks, will then be present for the rest
of the batch (i.e. until the end of yield_factories).  This fixes that issue, I
think.

An improvement to this scheme would be to then forget blocks from the batch once
we know nothing else in the batch will use them (e.g. by tracking how many keys
in the batch need a particular block/read_memo), but that's probably not very
important.

> Then again, I think the "batching cache" is really all the caching we really
> need. So I would be tempted to make that the only cache, and see what impact
> it has. (Do we need caching that persists between get_record_stream calls? We
> might, but things like CHKMap has its own cache...)

I wonder!  I'm willing to experiment, but for now I'd rather leave that cache
there.  It seems probable to me that it is having at least some minor beneficial
effect, and I'd rather not throw that away.

> Another possibility would be to use the cache as-is, but a cache miss
> re-requests the block. However, I don't think we can make a new request while
> we have a read pending, so that is probably much more complex than just
> changing the caching logic.

Right, I think so too.

> +        # Batch up as many keys as we can until either:
> +        #  - we encounter an unadded ref, or
> +        #  - we run out of keys, or
> +        #  - the total bytes to retrieve for this batch > 256k
> +        batcher = _BatchingBlockFetcher(self, locations)
> +        BATCH_SIZE = 2**18
> 
> ^- BATCH_SIZE should at a minimum be a global. I'm not sure why you chose 256
> vs 64 vs... nor why you don't use something like
> transport.recommended_page_size() (4k locally, 64k remote).

It was an arbitrary value plucked out of the air.  I'll set it globally to 64k.
I'm not sure how to cleanly get at a transport from here.  In my experiments the
value didn't really matter much, the main thing was to avoid fetching 300 bytes
at a time!

> going further, though. Is that we aren't paying attention to the 'index' value
> of the groups when we consider how we want to batch.
> 
> Specifically, you don't gain anything by batching index1 with index2, because
> that requires reading 2 files, which requires 2 round trips. Note that
> get_raw_records() does split things yet again into index-based batches.

Right.  Perhaps we should track the batch-size per index?  Or perhaps it doesn't
matter a lot.  We can tweak this in the future pretty easily, anyway.

> Now I suppose if my patch for bug #402645 goes through, then we will be
> fetching 'unordered' and the _get_io_ordered_source_keys will group things by
> index, and that should help a bit there.

Right, I'm hoping so too :)

> +                    if batcher.total_bytes > BATCH_SIZE:
> +                        # Ok, this batch is big enough.  Yield some results.
> +                        for _ in batcher.yield_factories():
> +                            yield _
> 
> ^- I personally don't like to see "_" as a variable that ever gets used on the right hand side.
> I'd prefer:
> for factory in batcher.yield_factories():
>   yield factory

Ok, changed.  (Although I find the longer lines with the word 'factory' repeated
negates the readability benefit for me, so it's much of a muchness.  So I'm
happy to go with your preference.)

> I would like to see tests added as part of this. Possibly direct tests of
> _BatchingBlockFetcher

Me too.  Unfortunately _BatchingBlockFetcher is still a bit too coupled to
GroupCompressedVersionedFiles to make for good unit tests.  I'll see what I can
do, but I think that'll be in a followup patch if that's ok with you.

> Ideally we would have some tests that we do, in fact, read in big batches so
> that we don't accidentally regress. It is something that isn't data-wise
> incorrect, but can have a large impact on performance for certain users. So it
> is something that is a bit hard to detect, and requires someone going
> "something doesn't seem right".
> 
> That said it is probably pretty hard to test accurately, and it would be a
> shame to miss out on the benefit for a long time while waiting for testing.

Perhaps usertest can help?

> Why not have "add_key()" return the number of total bytes so far. so this becomes:
> +                    batcher.add_key(key)
> +                    if batcher.total_bytes > BATCH_SIZE:
> +                        # Ok, this batch is big enough.  Yield some results.
> +                        for _ in batcher.yield_factories():
> +                            yield _
> 
> if batcher.add_key(key) > BATCH_SIZE:
>  ...
> 
> I'm just slightly hesitant about looking at object's attributes rather than
> function return values. Perhaps just overly so.

I've made this change.  I agree that it's a bit nicer to only rely on methods
rather than attributes for this object.

> I don't really like the idea that we are maintaining all the various read
> orders in various length lists that have to be kept in sync.
[...]
> Maybe if we just check that block = blocks.next(); block.location == read_memo
> ?
> That might be sufficient. Just some sort of check that everything is still in
> lock-step as we expect it to be.
> 
> Or even:
>   block_read_memo, block = blocks.next()
>   assert block_read_memo == read_memo

I've done essentially this.

Incremental diff attached.

-Andrew.