Zorba

Code review comment for lp:~zorba-coders/zorba/dataguide

dataguide
Merge into trunk

Revision history for this message

Nicolae Brinza (nbrinza) wrote on 2013-07-18:

> - I find the name dataguide misleading because it's a guide on the query and
> not on the data. Maybe QueryPruneGuide would be more meaningful

The query itself is not pruned, the data is. I think "dataguide" is the established term -- see for example this paper: http://ilpubs.stanford.edu:8090/264/1/1997-50.pdf .

> - Can the user also use the zann_explores_json annotation?

Yes, the users can use it as well. But does it make sense for them to use it? If they have an external function -- it is automatically handled as if it has the annotation. For a UDF it doesn't really make any sense to add it.

> - Why is the dataguide parameter on the Store's getCollection() function?
> Shouldn't it be on the function that returns the iterator? The problem is that
> a Collection object within the simplestore exists only once per collection.
> What's the semantics if multiple queries access the collection (possibly in
> parallel)?

It very much depends on how the collections are handled. Currently for Zorba collections it doesn't make sense to have any dataguides at all, because they're in-memory collections. I have not taken a look at the Sausalito code and have not seen how e.g. the MongoDB "collections" are managed. getCollection() seemed the most logical place where it should be passed, but the dataguide parameter could be easily propagated to any Store class, including the function that returns the iterator.

Currently each and every db:collection() call has its own dataguide, even if they might refer to the same collection. If the collection manager currently "caches" or reuses the collection iterators, then it might make sense to forbid that so that the dataguide for each individual db:collection call could be used.

Or alternatively, an "union" on the dataguides that refer to the same collection could be performed. But I think it is not always possible to determine if that is the case.

I think this could be investigated and decided upon when implementing the Dataguide push-down into MongoDB or when I would take a better look at the Sausalito's collection manager code.

> - Did you measure the performance impact of the optimizer on some larger
> queries?

The expression tree is traversed in its entirety once and only once, visiting each node, so the performance should not be very different from any other dataflow computation, e.g. ignores sorts/order/etc. If there are no "sources", i.e. db:collection() or jn:parse() calls, then the dataguide computation just propagates NULLs, doing no calculations and almost no memory allocations (at most one dataguide_cb allocation per fo_exprs and several others). If there are "sources" in the tree -- there will be some union operations being performed for some of the nodes.

I will check if any of our larger queries have longer compilation times, but because none of them have db:collection() or jn:parse() calls, I do not expect any differences.

It would make sense to have a specially constructed query that would do a stress-test of the dataguide code -- e.g. a db:collection().navigation.navigation. ... .navigation several thousand times or something similar. I will try that out and see if it manages to slow down the compilation.