Merge lp:~jamesh/mediascanner2/fix-tokenizer into lp:mediascanner2

Proposed by James Henstridge
Status: Merged
Approved by: Jussi Pakkanen
Approved revision: 250
Merged at revision: 249
Proposed branch: lp:~jamesh/mediascanner2/fix-tokenizer
Merge into: lp:mediascanner2
Diff against target: 83 lines (+8/-17)
4 files modified
CMakeLists.txt (+1/-1)
debian/control (+1/-1)
src/mediascanner/mozilla/fts3_porter.c (+5/-7)
test/test_mediastore.cc (+1/-8)
To merge this branch: bzr merge lp:~jamesh/mediascanner2/fix-tokenizer
Reviewer Review Type Date Requested Status
Jussi Pakkanen (community) Approve
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+225129@code.launchpad.net

Commit message

Update the FTS tokenizer to work with SQLite 3.8.5, and reenable the MediaStoreTest.query_short test. Bump libsqlite3-dev build dependency to 3.8.5 too.

Description of the change

Fix the full text search tokenizer to work with SQLite 3.8.5.

With the following change, sqlite3 stopped passing the wildcard suffix to the tokenizer, which broke our hack for detecting when we were tokenizing a wild card query:

  http://www.sqlite.org/cgi/src/info/e21bf7a2ade6373e94ea403c665f78e1ad22143f

So now always tokenize short words at the end of input.

This shouldn't require a rebuild of the index, since this last case in the tokenizer shouldn't have been hit when parsing actual file metadata (probably: it would have matched input that actually ended in '*'). The new version of the tokenizer could potentially cause extra words to be added to the index, but they are probably low quality (being short) so I wouldn't bother with a rebuild.

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Needs to close bug 1335281 but otherwise ok.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'CMakeLists.txt'
2--- CMakeLists.txt 2014-06-06 09:06:32 +0000
3+++ CMakeLists.txt 2014-07-01 10:16:01 +0000
4@@ -12,7 +12,7 @@
5 include(FindPkgConfig)
6 pkg_check_modules(MEDIASCANNER REQUIRED
7 gio-2.0
8- sqlite3
9+ sqlite3>=3.8.5
10 )
11 pkg_check_modules(GST gstreamer-1.0 gstreamer-pbutils-1.0 REQUIRED)
12 pkg_check_modules(GLIB glib-2.0 REQUIRED)
13
14=== modified file 'debian/control'
15--- debian/control 2014-06-24 15:47:53 +0000
16+++ debian/control 2014-07-01 10:16:01 +0000
17@@ -15,7 +15,7 @@
18 libgstreamer1.0-dev,
19 libgtest-dev,
20 libproperties-cpp-dev,
21- libsqlite3-dev,
22+ libsqlite3-dev (>= 3.8.5),
23 qt5-default,
24 qtbase5-dev,
25 qtbase5-dev-tools,
26
27=== modified file 'src/mediascanner/mozilla/fts3_porter.c'
28--- src/mediascanner/mozilla/fts3_porter.c 2014-05-14 13:22:54 +0000
29+++ src/mediascanner/mozilla/fts3_porter.c 2014-07-01 10:16:01 +0000
30@@ -1080,10 +1080,9 @@
31 /* We emit a token if:
32 * - there are two ideograms together,
33 * - there are three chars or more,
34- * - we think this is a query and wildcard magic is desired.
35- * We think is a wildcard query when we have at least one
36- * character, our current offset is one shy of nInput and the
37- * character at iOffset is '*'.
38+ * - we are at the end of the input and have at least one char.
39+ * This final case is to cover wildcard prefix searches with short
40+ * prefixes.
41 */
42 // It is possible we have no token to emit here if iPrevBigramOffset was not
43 // 0 on entry and there was no second CJK character. iPrevBigramOffset
44@@ -1092,10 +1091,9 @@
45 (numChars == 2 && state == BIGRAM_USE) ||
46 // otherwise, drop two-letter words (considered stop-words)
47 (numChars >=3) ||
48- // wildcard case:
49+ // final word wildcard case:
50 (numChars >= 1 &&
51- (c->iOffset == c->nInput - 1) &&
52- (z[c->iOffset] == '*'))) {
53+ c->iOffset == c->nInput)) {
54 /* figure out the number of bytes to copy/stem */
55 int n = c->iOffset - iStartOffset;
56 /* make sure there is enough buffer space */
57
58=== modified file 'test/test_mediastore.cc'
59--- test/test_mediastore.cc 2014-06-27 14:12:38 +0000
60+++ test/test_mediastore.cc 2014-07-01 10:16:01 +0000
61@@ -316,13 +316,6 @@
62 EXPECT_EQ(result[1], audio2); // title has highest weighting
63 }
64
65-/* When SQLite updated from 3.8.2 to 3.8.5, the behavior for
66- * short searches seems to have changed and the test queries
67- * here return empty. This test is currently disabled to make
68- * the test suite pass so a new version of Mediascanner
69- * can enter the archive. Do try to re-enable this as
70- * soon as possible.
71-
72 TEST_F(MediaStoreTest, query_short) {
73 MediaFile audio1 = MediaFileBuilder("/path/foo5.ogg")
74 .setType(AudioMedia)
75@@ -346,7 +339,7 @@
76 result = store.query("xy", AudioMedia);
77 EXPECT_EQ(result.size(), 1);
78 }
79-*/
80+
81 TEST_F(MediaStoreTest, query_empty) {
82 MediaFile audio1 = MediaFileBuilder("/path/foo5.ogg")
83 .setType(AudioMedia)

Subscribers

People subscribed via source and target branches