Code review comment for lp:~zorba-coders/zorba/dataguide

Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

> > - I find the name dataguide misleading because it's a guide on the query and
> > not on the data. Maybe QueryPruneGuide would be more meaningful
>
> The query itself is not pruned, the data is. I think "dataguide" is the
> established term -- see for example this paper:
> http://ilpubs.stanford.edu:8090/264/1/1997-50.pdf .
"DataGuides serve as dynamic schemas, generated from the database." What we generate is a schema from the query.

> > - Why is the dataguide parameter on the Store's getCollection() function?
> > Shouldn't it be on the function that returns the iterator? The problem is
> that
> > a Collection object within the simplestore exists only once per collection.
> > What's the semantics if multiple queries access the collection (possibly in
> > parallel)?
>
> It very much depends on how the collections are handled. Currently for Zorba
> collections it doesn't make sense to have any dataguides at all, because
> they're in-memory collections. I have not taken a look at the Sausalito code
> and have not seen how e.g. the MongoDB "collections" are managed.
> getCollection() seemed the most logical place where it should be passed, but
> the dataguide parameter could be easily propagated to any Store class,
> including the function that returns the iterator.
>
> Currently each and every db:collection() call has its own dataguide, even if
> they might refer to the same collection. If the collection manager currently
> "caches" or reuses the collection iterators, then it might make sense to
> forbid that so that the dataguide for each individual db:collection call could
> be used.
>
> Or alternatively, an "union" on the dataguides that refer to the same
> collection could be performed. But I think it is not always possible to
> determine if that is the case.
>
> I think this could be investigated and decided upon when implementing the
> Dataguide push-down into MongoDB or when I would take a better look at the
> Sausalito's collection manager code.
I think we will run into a problem. 28msec has only one buffer that is accessed by all db:collection() calls in a query. Hence, the information needs to be the union.

« Back to merge proposal