Merge lp:~adeuring/launchpad/bug-1020443-2 into lp:launchpad
Status: | Merged |
---|---|
Approved by: | Abel Deuring |
Approved revision: | no longer in the source branch. |
Merged at revision: | 15696 |
Proposed branch: | lp:~adeuring/launchpad/bug-1020443-2 |
Merge into: | lp:launchpad |
Prerequisite: | lp:~adeuring/launchpad/bug-1020443 |
Diff against target: |
1030 lines (+446/-81) 19 files modified
database/schema/patch-2209-24-3.sql (+124/-0) lib/lp/answers/doc/faq-vocabulary.txt (+1/-3) lib/lp/answers/doc/faq.txt (+1/-1) lib/lp/answers/model/faq.py (+1/-1) lib/lp/answers/model/question.py (+18/-6) lib/lp/answers/stories/this-is-a-faq.txt (+2/-2) lib/lp/bugs/doc/bugtask-find-similar.txt (+1/-0) lib/lp/bugs/doc/bugtask-search.txt (+58/-2) lib/lp/bugs/model/bugtasksearch.py (+5/-2) lib/lp/bugs/model/tests/test_bugtasksearch.py (+4/-2) lib/lp/registry/browser/product.py (+1/-1) lib/lp/registry/doc/vocabularies.txt (+11/-1) lib/lp/registry/model/person.py (+8/-7) lib/lp/registry/tests/test_person_vocabularies.py (+25/-1) lib/lp/registry/tests/test_personset.py (+53/-2) lib/lp/registry/tests/test_product_vocabularies.py (+10/-0) lib/lp/registry/tests/test_projectgroup_vocabulary.py (+26/-0) lib/lp/registry/vocabularies.py (+24/-20) lib/lp/services/database/doc/textsearching.txt (+73/-30) |
To merge this branch: | bzr merge lp:~adeuring/launchpad/bug-1020443-2 |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
j.c.sackett (community) | Approve | ||
Stuart Bishop | Pending | ||
Review via email: mp+116876@code.launchpad.net |
Commit message
Ignore the symbols "&|!" in full text searches; don't treat a leading '-' as the "NOT" operator. Use correct tsquery expressions for search result ranking.
Description of the change
This branch fixes numerous test failures which were triggered by the
changes from lp:~adeuring/launchpad/bug-1020443 (already approved).
In this branch, I dropped the questionble feature that the symbols
"&", "|", "!" are treaded as logical operators in full text search
queries.
This change had more side effects han I expected. DUring my work on
the fix I noticed two interesting bugs:
- the ranking of search results was sometimes wrong
- the words used in full text searches were sometimes stemmed twice
-- and the stemming algorithm from the Postgres text search
implementation is not idempotent, for certain words. Two examples
I noticed in the tests:
stemmed(
stemmed(
stemmed(
stemmed(
When a text is indexed after a change, the indexer stores the stem of
the words in the FTI; when a full text search expression is processed,
the procedure to_tsquery(
tokens, and if a token is treated as a word, it is stemmed. Finally,
the stemmed search words are looked up in an FTI column.
So, if we have the word "cheese" in an indexed text, "chees" is stored
in the FTI. During a regular search for "cheese", "cheese" is passed
(via the procedure ftq()) to to_tsquery(), and to_tsquery returns a
tsquery object containing the word "chees".
The broken searches call the function nl_phrase_search(), which issue
qn SQL query like "SELECT ftq('cheese')", which returns a string
representation of a tsquery. nl_phrase_search() extracts the different
stemmed words from the result, builds all subsets having one element
less than the original set of stemmd words, creates AND expressions
for the words of these subsets and finnaly returns an OR expression
of the AND expressions. In other words: It builds a query where at most
one word from the original search term may be missing in the searched
texts.
The string returned by nl_phrase_search() was often passed again to
ftq(). This has two side effects:
- ftq() calls to_tsquery() again, but now with the stemmed words,
leading to mappings like "cheese" -> "chees" -> "chee" -- and
the search will not return data having the original word "cheese".
And if "chee" instead if "chees" is used in the "sort by rank"
expression, we get the wrong sort order.
- ftq() now converts the symbols "&|!" to " ". (See the MP for
lp:~adeuring/launchpad/bug-1020443 for the reason of this change)
This means that the '|' is converted into an implicit AND, hence
the feature is dropped to find texts where one word from the search
string may be mssing.
notes about the changes in detail:
lib/lp/
ranking bug.
lib/lp/
'|' no longer treated equivalent to "OR"
lib/lp/
Avoid "double calls" of ftq(). fti_search is the result of an
nl_phrase_search() call; these results can be directly used
as tsquery expressions. If a query string is used like
'query&
the words are not stemmed. (See chapter 8.11.2 of the Postgres 9.1
documentation.)
test: ./bin/test -vvt lib/lp/
./bin/test -vvt lib/lp/
lib/lp/
class QuestionSearch is the base class a number of other classes.
Most of them store search phrases provided by a user in
self.search_text, with one exception: class SimilarQuestion
The class stores the result of nl_phrase_search() in self.search_text.
I added the flag self.nl_phrase_used to keep track of this difference.
The methods getConstraints() and getOrderByClause() now generate
SQL expressions that are useful both for "plain" search texts and
for texts processed by for nl_phrase_search().
test: ./bin/test -vvt lib/lp/
(no explicit additions or changes in this doc test, but the changes
described above fix failures.)
lib/lp/
Correct treatment of BugTaskSearchPa
from BugTaskSearchPa
search text as provided by a user, the former stores the result of
an nl_phrase_search() call. fast_searchtext is now treated as
described above. I had to change lib/lp/
The first change in this file (line 138) showed an improper use of
fast_searchtext: That property is supposed to contain only stemmed
words, which are always lower case. With my changes to
bugtasksearch.py, the example fails. I added a few lines to explain
the difference between the the BugTaskSearchParams properties.
The change in lib/lp/
search result -- is caused because the search word "cheese" is no
longer stemmed twice before being used in an SQL expression like
"table.fti @@ 'chees'::tsquery.
test: ./bin/test -vvt lib/lp/
./bin/test -vvt lib/lp/
./bin/test -vvt lp.bugs.
lib/lp/
Product.
OR combined. The method used the no longer working operator '|';
now changed to 'OR'.
test: ./bin/test -vvt lib/lp/
(yes -- testing this change only in a story is not strictly sufficient --
but the feature "search for any word" should have had it's own test before.
I can add a test if it's really necessary.)
lib/lp/
The methods find(text,...), findPerson(
class PersonSet called "text = text.lower()" before doing any further
processing. This breaks ftq() when the query includes the operators
AND, OR, NOT: They are only treated as operators if they are upper case.
Since these methods also use things like
EmailAddr
where search_text should of course be lower case, the code looks now a
bit more complicated. I added a few tests to test_personset.py to
check that the variants "plain search text" and "lower case search text"
are properly used.
test: ./bin/test -vvt test_personset.
./bin/test -vvt lib/lp/
(the changed methods are used in "vocabulary filtering".)
lib/lp/
Vocabularies have a method search(text, ...), where text is used in
SQL expressions like "Product.name LIKE 'text'" as well as
"Product.fti @@ ftq('text')".
These methods used, like Person.
case variant for both expressions. I changed this for the FTI query.
test: ./bin/test registry -vvt test_person_
./bin/test registry -vvt test_product_
./bin/test registry -vvt test_projectgro
Lint: Lots of messages about moin headers in doc tests and related
doc test complaints, but none of them are related to my changes. (OK,
except that I added a moin header for consistency's sake to
lib/lp/
Abel--
Looks alright. Two possible issues I'll leave to you to resolve before landing.
#710 you define like_query as query.lower(), then redefine it as another derivative of query; was the second line meant to be replaced, or further modify like_query instead of query?
#720 Same thing