OpenStack Compute (nova)

Code review comment for lp:~cbehrens/nova/servers-search

servers-search
Merge into trunk

Revision history for this message

Brian Lamar (blamar) wrote on 2011-07-14:

> > > I recognize (and agree with your decision) not to do regexp matching via
> the
> > database. Not only is
> > > it not portable, it's not any more efficient to do that at the database
> > level (still requires a
> > > scan of all pre-restricted rows anyway...).
> >
> > Regular expressions are more expensive than LIKE matches (which in their own
> > right, are pretty expensive).
>
> Actually, this is incorrect. LIKE '%something%' and column REGEXP 'someregexp'
> will produce identical query execution plans. The complexity of the REGEXP
> determines whether or not a simple string match such as '%something%' would be
> computationally more expensive to execute per row than a compiled regexp
> match.
>
> > Do we really want operators doing complex
> > regexs? At that point we should be putting our data into a purpose-built
> > search indexing solution like Lucene/Solr/ElasticSearch/Sphinx because
> that's
> > what they're good at.
>
> Lucene/Solar/ElasticSearch/Sphinx are fulltext indexing technologies. What's
> happening here is looking for a particular pattern in a short string. The
> solution presented here is flexible enough to query for various IP(v6) and
> name patterns without having to set up a separate fulltext indexing server for
> this kind of thing, which I think would be overboard.
>
> I understand your concern about the regexp inefficiency. Just saying that it's
> not that much less efficient than doing a REGEXP or LIKE '%something%'
> expression in SQL. The same loop and match process is occurring in Python code
> versus C code. The problem is that not all DBs support the REGEXP operator...
>
> Just my two cents,
> -jay

Wow, just gonna have to agree to disagree then... I'm really positive it's going to be tons more efficient to do simple wildcard matching vs. fully-fledged perl-compatible regular expressions. I'm all about not putting specialized code into projects and I see this as being search-specialized code. I'm going to bow out now because I don't want to further this discussion on a merge-prop (sorry Chris!!).

> 1) Some of the things I'm matching against are not stored in the database. IPv6 addresses and the
> instance 'name', in particular. So, I'd have to build search functions in python that can parse
> '%..%'. That.. or I'd end up supporting regular expressions for some queries and the sql 'like'
> format for others.

Yeah, I can't believe that v6 address aren't stored in the DB. :) Good point.

> 2) Related to #1 in a way, what if the backend ends up not being SQL at all?

Heh, good luck with this one, but the point I'm trying to make is compatible with this as well. I use ElasticSearch with some of my CouchDB projects and there are tons of indexers for all types of datastores.

> 3) 'like' does a full table scan anyway and regular expressions are a little more flexible.

Absolutely, I just feel that we should be leaving the database to do the data-storage, a search-indexer to do searching, and Python to tie everything together.

Going to abstain from this merge prop, my apologies for cluttering it with conversations that most assuredly should have happened at the Blueprint level.

> > > I recognize (and agree with your decision) not to do regexp matching via
> the
> > database. Not only is
> > > it  not portable, it's not any more efficient to do that at the database
> > level (still requires a
> > > scan of all pre-restricted rows anyway...).
> >
> > Regular expressions are more expensive than LIKE matches (which in their own
> > right, are pretty expensive).
> 
> Actually, this is incorrect. LIKE '%something%' and column REGEXP 'someregexp'
> will produce identical query execution plans. The complexity of the REGEXP
> determines whether or not a simple string match such as '%something%' would be
> computationally more expensive to execute per row than a compiled regexp
> match.
> 
> > Do we really want operators doing complex
> > regexs? At that point we should be putting our data into a purpose-built
> > search indexing solution like Lucene/Solr/ElasticSearch/Sphinx because
> that's
> > what they're good at.
> 
> Lucene/Solar/ElasticSearch/Sphinx are fulltext indexing technologies. What's
> happening here is looking for a particular pattern in a short string. The
> solution presented here is flexible enough to query for various IP(v6) and
> name patterns without having to set up a separate fulltext indexing server for
> this kind of thing, which I think would be overboard.
> 
> I understand your concern about the regexp inefficiency. Just saying that it's
> not that much less efficient than doing a REGEXP or LIKE '%something%'
> expression in SQL. The same loop and match process is occurring in Python code
> versus C code. The problem is that not all DBs support the REGEXP operator...
> 
> Just my two cents,
> -jay

> 1) Some of the things I'm matching against are not stored in the database.  IPv6 addresses and the 
> instance 'name', in particular.  So, I'd have to build search functions in python that can parse 
> '%..%'.  That.. or I'd end up supporting regular expressions for some queries and the sql 'like' 
> format for others.

Yeah, I can't believe that v6 address aren't stored in the DB. :) Good point.

> 2) Related to #1 in a way, what if the backend ends up not being SQL at all?

> 3) 'like' does a full table scan anyway and regular expressions are a little more flexible.

Absolutely, I just feel that we should be leaving the database to do the data-storage, a search-indexer to do searching, and Python to tie everything together.

Going to abstain from this merge prop, my apologies for cluttering it with conversations that most assuredly should have happened at the Blueprint level.

review: Abstain

« Back to merge proposal