Merge lp:~adeuring/launchpad/bug-29713 into lp:launchpad
Status: | Merged |
---|---|
Approved by: | Abel Deuring |
Approved revision: | no longer in the source branch. |
Merged at revision: | 15464 |
Proposed branch: | lp:~adeuring/launchpad/bug-29713 |
Merge into: | lp:launchpad |
Diff against target: |
607 lines (+311/-117) 4 files modified
database/schema/patch-2209-24-1.sql (+125/-0) lib/lp/answers/stories/question-browse-and-search.txt (+1/-1) lib/lp/registry/doc/vocabularies.txt (+1/-2) lib/lp/services/database/doc/textsearching.txt (+184/-114) |
To merge this branch: | bzr merge lp:~adeuring/launchpad/bug-29713 |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Raphaël Badin (community) | Approve | ||
Stuart Bishop (community) | db | Approve | |
Francesco Banconi (community) | code* | Approve | |
Review via email: mp+111212@code.launchpad.net |
Commit message
Better handling of punctuation in full text search strings
Description of the change
This branch fixes many of the issues described in bug 29273:
bug search fails to find results when punctuation is adjacent to
regular text in the document (e.g. '"from"', '<div>')
The two remaining issues are described in bug 1015511 and bug 1015519.
My main "guide line" for this branch was that any text fragment copied
from a full-text-index text and used as a search string should return
the text it was copied from. (Obvious constraints: The words should
not be stop words, and only whole words should be copied.)
We have two DB procedures ftq() and _ftq() which are used to build
full text search queries similar to:
SELECT ... FROM bugtaskflat
WHERE bugtaskflat.fti @@ ftq('search text provided by LP user');
ftq() prepares the string given by a user so that it can be passed to
the procedure to_tsquery(). _ftq() is a debugging variant: While
ftq() returns the result from calling to_tsquery(
it returns just processed_
The problems described in bugs 1015511 and 1015519 aside, the main
issues were
(1) an "overly eager" replacement of punctuation characters with "-"
(2) a replacement like
aaa-bbb -> (aaabbb | (aaa & bbb))
1. Hyphenation handling: The old code
# Convert foo-bar to ((foo&bar)|foobar) and foo-bar-baz to
# ((foo&bar&
def hyphen_repl(match):
bits = match.group(
return "((%s)|%s)" % ("&".join(bits), "".join(bits))
query = re.sub(
## plpy.debug('4 query is %s' % repr(query))
# Any remaining - characters are spurious
query = query.replace(
was outdated: This converts the string 'foo-bar' into
the search expression
'foobar|
But the FTI data stored by Postgres for the string 'foo-bar' is
select to_tsvector(
---
'bar':3 'foo':2 'foo-bar':1
Applying to_tsquery(
return a match, but other manipulations by ftq() (now also
changed/removed) lead to "search failures" for many typical filenames,
see below.
Moreover, ftq() does not need to decompose 'foo-bar' into 'foo' and 'bar'
because to_tsquery() does this itself, and in a way that matches the
data produced by to_tsvector() better:
select to_tsquery(
---
'foo-bar' & 'foo' & 'bar'
Finally, the old hypen handling breaks the search for filenames containing
hypens. I added a test for this case.
So I simply removed the code above.
2. Much punctuation was replaced with a '-'. This leads, combined with the
issue described above, to the problem from the bug report that full text
searches for file names fail.
Example:
select _ftq('file-
---
((
select ftq('file-
---
'file' & 'name' & 'py' | 'filenamepi'
while the FTI data looks like
select to_tsvector(
to_tsvector
---
'file-
So, the FTI stores just the plain filename, nothing else, while the query
data asks to look for a few sightly different terms.
On the other hand, to_tsquery() handles file names just fine:
select to_tsquery(
to_tsquery
---
'file-name.py'
The following part of the current implementation of ftq() replaces
a number of characters qith a '.':
punctuation = r"[^\w\
query = re.sub(
query = re.sub(r"(?u)%s+" % (punctuation,), " ", query)
This means that the symbols "#$%*+,
other Unicode symbols: anything that is not a word or white space or
contained in '-&|!()') are replaced with a '-' if they "connect" two
words, and that they are replaced with a ' ' otherwise. A comparison with
the FTI data:
select to_tsvector(
to_tsvector
---
'bar':2 'foo':1
select to_tsquery(
to_tsquery
---------------
'foo' & 'bar'
select ftq('foo"bar');
ftq
---
'foo-bar' & 'foo' & 'bar'
(the last output comes from ftq() having the changes described under (1).)
So, passing the unchanged search string to to_tsquery() will create
a query object that will match against the FTI data, while the search
string mangled by ftq() will not find the original string: 'foo-bar'
is not part of the FTI data.
I think the main lesson from the example above is that we should let
to_tsquery() handle punctuation as much as possible, because it treats
punctuation almost identical to to_tsvector(). Two execptions remain:
'\' and ':'. The new tests beginning in line 263 of
lib/lp/
is now handled correctly.
Bug 33920 had a special section in the doc test ("It was noticed though
in Bug #33920..."). As the new test part "Punctuation is handled
consistently..." shows, this is no longer an issue, so I removed the tests
about bug 33920.
Same for bug 39828 ("It was also noticed through Bug #39828" in the original
doc test).
I found the test "Bug #44913 - Unicode characters in the wrong place"
a bit confusing: "What happened to the leading 'a-'"? The answer is
simple: 'a' is a stop word. But I think this distracts from the main
purpose of the test, so I replaced 'a-' with 'abc-'. This results in the
usual query produced by ftq() and to_tsquery() for words containing a '-'.
Miscellaneous
-------------
The stored procedures ftq() and _ftq() were nearly identical: _ftq()
just processes a query string and returns the processed query -- the
procedure is used only for tests/debugging; ftq() did the same processing
but returns the result of
SELECT to_tsquery(
I changed the latter procedure to just call
SELECT to_tsquery(
This make the code a bit shorter and avoids the risk that the implementation
of ftq() and _ftq() diverges.
The doc test defines a function ftq() which returns the results of calls
to the stored procedures ftq() and _ftq(). This function helps to understand
how these procedures process a query string, but the test function does
not help to check if a given query will match the FTI data of a given text.
Some of the "bad queries" mentioned in bug 29713 can only be understood
by looking at both the Postgres query object for a given search term and
at the FTI data stored for the same term. So I added two test helpers
search(full_text, query_text) and search_same(text) which show the FTI
data for full_text, the Postgres query object for query_text and the result
of searching query_text in full_text.
I changed some of the existing tests to use search() instead of ftq()
because I think that the former function shows a bit better that the
search term can be properly used. For example, the old test for the query
u'a-a\N{RIGHT DOUBLE QUOTATION MARK}' showed that the quotation mark
was removed from the query. This seems to be useful -- but the new test
using search_same() shows that the FTI data for this string contains the
quotation mark, hence the FTI query object should keep it too.
(The fact that the quotation mark is treated by to_tsvector() as being
part of a word is of course a bug, but that is outside the the scope
of this branch. My goal for now is to clean up ftq(). A proper fix would
be to tweak the parser used by Postgres in to_tsquery() and in
to_tsvector().)
LOC counting
------------
The diff between the recent version of the branch and revision 15402
(the version I branched from trunk) shows 302 added lines and 116 removed
lines.
But I'd claim that this is a temporary increase of lines. My first commit
(r15403) just adds a DB patch file which contains a plain copy of the
existing procedures ftq() and _ftq(). Running
bzr diff -r 15403 | grep ^+ | wc
bzr diff -r 15403 | grep ^- | wc
shows 194 added lines and 254 removed lines. I'd claim that this count
is more appropriate than the count of added/removed lines against trunk
because the the DB patch file will finally disappear, when a new
"main schema file" (database/
generated. And the diff against r15403 for the file
database/
changes we can expect betwwen the current
database/
database/
test: ./bin/test services -vvt textsearching.txt
= Launchpad lint =
Checking for conflicts and issues in changed files.
Linting changed files:
database/
lib/lp/
./lib/lp/
697: want exceeds 78 characters.
That's not caused by my changes.
This branch looks great Abel! Also thanks for the detailed description.
I guess pointing to bug 29273 (at the beginning of the description) is a typo.