Merge lp:~cyphermox/ubuntu/oneiric/xapian-core/lp833172 into lp:ubuntu/oneiric/xapian-core
- Oneiric (11.10)
- lp833172
- Merge into oneiric
Proposed by
Mathieu Trudel-Lapierre
Status: | Merged |
---|---|
Merged at revision: | 21 |
Proposed branch: | lp:~cyphermox/ubuntu/oneiric/xapian-core/lp833172 |
Merge into: | lp:ubuntu/oneiric/xapian-core |
Diff against target: |
726 lines (+689/-2) 4 files modified
debian/changelog (+7/-0) debian/control (+2/-1) debian/control.in (+2/-1) debian/patches/cjk-ngram-applied-to-1.2-branch.patch (+678/-0) |
To merge this branch: | bzr merge lp:~cyphermox/ubuntu/oneiric/xapian-core/lp833172 |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Ubuntu Sponsors | Pending | ||
Review via email:
|
Commit message
Description of the change
To post a comment you must log in.
Preview Diff
[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1 | === modified file 'debian/changelog' | |||
2 | --- debian/changelog 2011-04-06 02:19:10 +0000 | |||
3 | +++ debian/changelog 2011-08-25 03:14:24 +0000 | |||
4 | @@ -1,3 +1,10 @@ | |||
5 | 1 | xapian-core (1.2.5-1ubuntu1) UNRELEASED; urgency=low | ||
6 | 2 | |||
7 | 3 | * debian/patches/cjk-ngram-applied-to-1.2-branch.patch: add support for CJK | ||
8 | 4 | input methods by adding a tokenizer for CJK. (LP: #833172) | ||
9 | 5 | |||
10 | 6 | -- Mathieu Trudel-Lapierre <mathieu-tl@ubuntu.com> Wed, 24 Aug 2011 19:29:01 -0400 | ||
11 | 7 | |||
12 | 1 | xapian-core (1.2.5-1) unstable; urgency=low | 8 | xapian-core (1.2.5-1) unstable; urgency=low |
13 | 2 | 9 | ||
14 | 3 | * New upstream release. | 10 | * New upstream release. |
15 | 4 | 11 | ||
16 | === modified file 'debian/control' | |||
17 | --- debian/control 2010-08-24 11:18:50 +0000 | |||
18 | +++ debian/control 2011-08-25 03:14:24 +0000 | |||
19 | @@ -1,7 +1,8 @@ | |||
20 | 1 | Source: xapian-core | 1 | Source: xapian-core |
21 | 2 | Section: libs | 2 | Section: libs |
22 | 3 | Priority: optional | 3 | Priority: optional |
24 | 4 | Maintainer: Olly Betts <olly@survex.com> | 4 | Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com> |
25 | 5 | XSBC-Original-Maintainer: Olly Betts <olly@survex.com> | ||
26 | 5 | Standards-Version: 3.9.1 | 6 | Standards-Version: 3.9.1 |
27 | 6 | Build-Depends: debhelper (>= 7), autotools-dev, zlib1g-dev, uuid-dev | 7 | Build-Depends: debhelper (>= 7), autotools-dev, zlib1g-dev, uuid-dev |
28 | 7 | Homepage: http://xapian.org/ | 8 | Homepage: http://xapian.org/ |
29 | 8 | 9 | ||
30 | === modified file 'debian/control.in' | |||
31 | --- debian/control.in 2010-08-24 11:18:50 +0000 | |||
32 | +++ debian/control.in 2011-08-25 03:14:24 +0000 | |||
33 | @@ -1,7 +1,8 @@ | |||
34 | 1 | Source: xapian-core | 1 | Source: xapian-core |
35 | 2 | Section: libs | 2 | Section: libs |
36 | 3 | Priority: optional | 3 | Priority: optional |
38 | 4 | Maintainer: Olly Betts <olly@survex.com> | 4 | Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com> |
39 | 5 | XSBC-Original-Maintainer: Olly Betts <olly@survex.com> | ||
40 | 5 | Standards-Version: 3.9.1 | 6 | Standards-Version: 3.9.1 |
41 | 6 | Build-Depends: @BUILD_DEPS@ autotools-dev, zlib1g-dev, uuid-dev | 7 | Build-Depends: @BUILD_DEPS@ autotools-dev, zlib1g-dev, uuid-dev |
42 | 7 | Homepage: http://xapian.org/ | 8 | Homepage: http://xapian.org/ |
43 | 8 | 9 | ||
44 | === added directory 'debian/patches' | |||
45 | === added file 'debian/patches/cjk-ngram-applied-to-1.2-branch.patch' | |||
46 | --- debian/patches/cjk-ngram-applied-to-1.2-branch.patch 1970-01-01 00:00:00 +0000 | |||
47 | +++ debian/patches/cjk-ngram-applied-to-1.2-branch.patch 2011-08-25 03:14:24 +0000 | |||
48 | @@ -0,0 +1,678 @@ | |||
49 | 1 | Origin: http://trac.xapian.org/attachment/ticket/180/cjk-ngram-applied-to-1.2-branch.patch | ||
50 | 2 | Subject: Add support for CJK text to queryparser and termgenerator | ||
51 | 3 | Bug-Ubuntu: https://bugs.launchpad.net/ubuntu/+source/xapian-core/+bug/833172 | ||
52 | 4 | Bug: http://trac.xapian.org/ticket/180 | ||
53 | 5 | Last-Update: 2011-08-24 | ||
54 | 6 | |||
55 | 7 | Index: xapian-core/queryparser/Makefile.mk | ||
56 | 8 | =================================================================== | ||
57 | 9 | --- xapian-core.orig/queryparser/Makefile.mk 2011-08-24 19:09:38.000000000 -0400 | ||
58 | 10 | +++ xapian-core/queryparser/Makefile.mk 2011-08-24 19:39:30.756055473 -0400 | ||
59 | 11 | @@ -5,6 +5,7 @@ | ||
60 | 12 | endif | ||
61 | 13 | |||
62 | 14 | noinst_HEADERS +=\ | ||
63 | 15 | + queryparser/cjk-tokenizer.h\ | ||
64 | 16 | queryparser/queryparser_internal.h\ | ||
65 | 17 | queryparser/queryparser_token.h\ | ||
66 | 18 | queryparser/termgenerator_internal.h | ||
67 | 19 | @@ -57,6 +58,7 @@ | ||
68 | 20 | endif | ||
69 | 21 | |||
70 | 22 | lib_src +=\ | ||
71 | 23 | + queryparser/cjk-tokenizer.cc\ | ||
72 | 24 | queryparser/queryparser.cc\ | ||
73 | 25 | queryparser/queryparser_internal.cc\ | ||
74 | 26 | queryparser/termgenerator.cc\ | ||
75 | 27 | Index: xapian-core/queryparser/cjk-tokenizer.cc | ||
76 | 28 | =================================================================== | ||
77 | 29 | --- /dev/null 1970-01-01 00:00:00.000000000 +0000 | ||
78 | 30 | +++ xapian-core/queryparser/cjk-tokenizer.cc 2011-08-24 19:39:30.756055473 -0400 | ||
79 | 31 | @@ -0,0 +1,124 @@ | ||
80 | 32 | +/** @file cjk-tokenizer.cc | ||
81 | 33 | + * @brief Tokenise CJK text as n-grams | ||
82 | 34 | + */ | ||
83 | 35 | +/* Copyright (c) 2007, 2008 Yung-chung Lin (henearkrxern@gmail.com) | ||
84 | 36 | + * Copyright (c) 2011 Richard Boulton (richard@tartarus.org) | ||
85 | 37 | + * Copyright (c) 2011 Brandon Schaefer (brandontschaefer@gmail.com) | ||
86 | 38 | + * Copyright (c) 2011 Olly Betts | ||
87 | 39 | + * | ||
88 | 40 | + * Permission is hereby granted, free of charge, to any person obtaining a copy | ||
89 | 41 | + * of this software and associated documentation files (the "Software"), to deal | ||
90 | 42 | + * deal in the Software without restriction, including without limitation the | ||
91 | 43 | + * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or | ||
92 | 44 | + * sell copies of the Software, and to permit persons to whom the Software is | ||
93 | 45 | + * furnished to do so, subject to the following conditions: | ||
94 | 46 | + * | ||
95 | 47 | + * The above copyright notice and this permission notice shall be included in | ||
96 | 48 | + * all copies or substantial portions of the Software. | ||
97 | 49 | + * | ||
98 | 50 | + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
99 | 51 | + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
100 | 52 | + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
101 | 53 | + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
102 | 54 | + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING | ||
103 | 55 | + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS | ||
104 | 56 | + * IN THE SOFTWARE. | ||
105 | 57 | + */ | ||
106 | 58 | + | ||
107 | 59 | +#include <config.h> | ||
108 | 60 | + | ||
109 | 61 | +#include "cjk-tokenizer.h" | ||
110 | 62 | + | ||
111 | 63 | +#include "omassert.h" | ||
112 | 64 | +#include "xapian/unicode.h" | ||
113 | 65 | + | ||
114 | 66 | +#include <cstdlib> | ||
115 | 67 | +#include <string> | ||
116 | 68 | + | ||
117 | 69 | +using namespace std; | ||
118 | 70 | + | ||
119 | 71 | +static unsigned NGRAM_SIZE = 2; | ||
120 | 72 | + | ||
121 | 73 | +bool | ||
122 | 74 | +CJK::is_cjk_enabled() | ||
123 | 75 | +{ | ||
124 | 76 | + const char * p; | ||
125 | 77 | + static bool result = ((p = getenv("XAPIAN_CJK_NGRAM")) != NULL && *p); | ||
126 | 78 | + return result; | ||
127 | 79 | +} | ||
128 | 80 | + | ||
129 | 81 | +// 2E80..2EFF; CJK Radicals Supplement | ||
130 | 82 | +// 3000..303F; CJK Symbols and Punctuation | ||
131 | 83 | +// 3040..309F; Hiragana | ||
132 | 84 | +// 30A0..30FF; Katakana | ||
133 | 85 | +// 3100..312F; Bopomofo | ||
134 | 86 | +// 3130..318F; Hangul Compatibility Jamo | ||
135 | 87 | +// 3190..319F; Kanbun | ||
136 | 88 | +// 31A0..31BF; Bopomofo Extended | ||
137 | 89 | +// 31C0..31EF; CJK Strokes | ||
138 | 90 | +// 31F0..31FF; Katakana Phonetic Extensions | ||
139 | 91 | +// 3200..32FF; Enclosed CJK Letters and Months | ||
140 | 92 | +// 3300..33FF; CJK Compatibility | ||
141 | 93 | +// 3400..4DBF; CJK Unified Ideographs Extension A | ||
142 | 94 | +// 4DC0..4DFF; Yijing Hexagram Symbols | ||
143 | 95 | +// 4E00..9FFF; CJK Unified Ideographs | ||
144 | 96 | +// A700..A71F; Modifier Tone Letters | ||
145 | 97 | +// AC00..D7AF; Hangul Syllables | ||
146 | 98 | +// F900..FAFF; CJK Compatibility Ideographs | ||
147 | 99 | +// FE30..FE4F; CJK Compatibility Forms | ||
148 | 100 | +// FF00..FFEF; Halfwidth and Fullwidth Forms | ||
149 | 101 | +// 20000..2A6DF; CJK Unified Ideographs Extension B | ||
150 | 102 | +// 2F800..2FA1F; CJK Compatibility Ideographs Supplement | ||
151 | 103 | +bool | ||
152 | 104 | +CJK::codepoint_is_cjk(unsigned p) | ||
153 | 105 | +{ | ||
154 | 106 | + if (p < 0x2E80) return false; | ||
155 | 107 | + return ((p >= 0x2E80 && p <= 0x2EFF) || | ||
156 | 108 | + (p >= 0x3000 && p <= 0x9FFF) || | ||
157 | 109 | + (p >= 0xA700 && p <= 0xA71F) || | ||
158 | 110 | + (p >= 0xAC00 && p <= 0xD7AF) || | ||
159 | 111 | + (p >= 0xF900 && p <= 0xFAFF) || | ||
160 | 112 | + (p >= 0xFE30 && p <= 0xFE4F) || | ||
161 | 113 | + (p >= 0xFF00 && p <= 0xFFEF) || | ||
162 | 114 | + (p >= 0x20000 && p <= 0x2A6DF) || | ||
163 | 115 | + (p >= 0x2F800 && p <= 0x2FA1F)); | ||
164 | 116 | +} | ||
165 | 117 | + | ||
166 | 118 | +string | ||
167 | 119 | +CJK::get_cjk(Xapian::Utf8Iterator &it) | ||
168 | 120 | +{ | ||
169 | 121 | + string str; | ||
170 | 122 | + while (it != Xapian::Utf8Iterator() && codepoint_is_cjk(*it)) { | ||
171 | 123 | + Xapian::Unicode::append_utf8(str, *it); | ||
172 | 124 | + ++it; | ||
173 | 125 | + } | ||
174 | 126 | + return str; | ||
175 | 127 | +} | ||
176 | 128 | + | ||
177 | 129 | +const string & | ||
178 | 130 | +CJKTokenIterator::operator*() const | ||
179 | 131 | +{ | ||
180 | 132 | + if (current_token.empty()) { | ||
181 | 133 | + Assert(it != Xapian::Utf8Iterator()); | ||
182 | 134 | + p = it; | ||
183 | 135 | + Xapian::Unicode::append_utf8(current_token, *p); | ||
184 | 136 | + ++p; | ||
185 | 137 | + len = 1; | ||
186 | 138 | + } | ||
187 | 139 | + return current_token; | ||
188 | 140 | +} | ||
189 | 141 | + | ||
190 | 142 | +CJKTokenIterator & | ||
191 | 143 | +CJKTokenIterator::operator++() | ||
192 | 144 | +{ | ||
193 | 145 | + if (len < NGRAM_SIZE && p != Xapian::Utf8Iterator()) { | ||
194 | 146 | + Xapian::Unicode::append_utf8(current_token, *p); | ||
195 | 147 | + ++p; | ||
196 | 148 | + ++len; | ||
197 | 149 | + } else { | ||
198 | 150 | + Assert(it != Xapian::Utf8Iterator()); | ||
199 | 151 | + ++it; | ||
200 | 152 | + current_token.resize(0); | ||
201 | 153 | + } | ||
202 | 154 | + return *this; | ||
203 | 155 | +} | ||
204 | 156 | Index: xapian-core/queryparser/queryparser.lemony | ||
205 | 157 | =================================================================== | ||
206 | 158 | --- xapian-core.orig/queryparser/queryparser.lemony 2011-08-24 19:09:38.000000000 -0400 | ||
207 | 159 | +++ xapian-core/queryparser/queryparser.lemony 2011-08-24 19:39:30.756055473 -0400 | ||
208 | 160 | @@ -31,6 +31,8 @@ | ||
209 | 161 | // Include the list of token values lemon generates. | ||
210 | 162 | #include "queryparser_token.h" | ||
211 | 163 | |||
212 | 164 | +#include "cjk-tokenizer.h" | ||
213 | 165 | + | ||
214 | 166 | #include <algorithm> | ||
215 | 167 | #include <list> | ||
216 | 168 | #include <string> | ||
217 | 169 | @@ -133,6 +135,8 @@ | ||
218 | 170 | } | ||
219 | 171 | }; | ||
220 | 172 | |||
221 | 173 | +class Terms; | ||
222 | 174 | + | ||
223 | 175 | /** Class used to pass information about a token from lexer to parser. | ||
224 | 176 | * | ||
225 | 177 | * Generally an instance of this class carries term information, but it can be | ||
226 | 178 | @@ -189,6 +193,12 @@ | ||
227 | 179 | */ | ||
228 | 180 | Query * as_partial_query(State * state_) const; | ||
229 | 181 | |||
230 | 182 | + /** Build a query for a string of CJK characters. */ | ||
231 | 183 | + Query * as_cjk_query() const; | ||
232 | 184 | + | ||
233 | 185 | + /** Handle a CJK character string in a positional context. */ | ||
234 | 186 | + void as_positional_cjk_term(Terms * terms) const; | ||
235 | 187 | + | ||
236 | 188 | /// Value range query. | ||
237 | 189 | Query as_value_range_query() const; | ||
238 | 190 | |||
239 | 191 | @@ -413,6 +423,24 @@ | ||
240 | 192 | return q; | ||
241 | 193 | } | ||
242 | 194 | |||
243 | 195 | +Query * | ||
244 | 196 | +Term::as_cjk_query() const | ||
245 | 197 | +{ | ||
246 | 198 | + vector<Query> prefix_cjk; | ||
247 | 199 | + const list<string> & prefixes = prefix_info->prefixes; | ||
248 | 200 | + list<string>::const_iterator piter; | ||
249 | 201 | + for (CJKTokenIterator tk(name); tk != CJKTokenIterator(); ++tk) { | ||
250 | 202 | + for (piter = prefixes.begin(); piter != prefixes.end(); ++piter) { | ||
251 | 203 | + string cjk = *piter; | ||
252 | 204 | + cjk += *tk; | ||
253 | 205 | + prefix_cjk.push_back(Query(cjk, 1, pos)); | ||
254 | 206 | + } | ||
255 | 207 | + } | ||
256 | 208 | + Query * q = new Query(Query::OP_AND, prefix_cjk.begin(), prefix_cjk.end()); | ||
257 | 209 | + delete this; | ||
258 | 210 | + return q; | ||
259 | 211 | +} | ||
260 | 212 | + | ||
261 | 213 | Query | ||
262 | 214 | Term::as_value_range_query() const | ||
263 | 215 | { | ||
264 | 216 | @@ -520,6 +548,7 @@ | ||
265 | 217 | |||
266 | 218 | string | ||
267 | 219 | QueryParser::Internal::parse_term(Utf8Iterator &it, const Utf8Iterator &end, | ||
268 | 220 | + bool cjk_ngram, bool & is_cjk_term, | ||
269 | 221 | bool &was_acronym) | ||
270 | 222 | { | ||
271 | 223 | string term; | ||
272 | 224 | @@ -545,10 +574,16 @@ | ||
273 | 225 | } | ||
274 | 226 | was_acronym = !term.empty(); | ||
275 | 227 | |||
276 | 228 | + if (cjk_ngram && term.empty() && CJK::codepoint_is_cjk(*it)) { | ||
277 | 229 | + term = CJK::get_cjk(it); | ||
278 | 230 | + is_cjk_term = true; | ||
279 | 231 | + } | ||
280 | 232 | + | ||
281 | 233 | if (term.empty()) { | ||
282 | 234 | unsigned prevch = *it; | ||
283 | 235 | Unicode::append_utf8(term, prevch); | ||
284 | 236 | while (++it != end) { | ||
285 | 237 | + if (cjk_ngram && CJK::codepoint_is_cjk(*it)) break; | ||
286 | 238 | unsigned ch = *it; | ||
287 | 239 | if (!is_wordchar(ch)) { | ||
288 | 240 | // Treat a single embedded '&' or "'" or similar as a word | ||
289 | 241 | @@ -617,6 +652,8 @@ | ||
290 | 242 | QueryParser::Internal::parse_query(const string &qs, unsigned flags, | ||
291 | 243 | const string &default_prefix) | ||
292 | 244 | { | ||
293 | 245 | + bool cjk_ngram = CJK::is_cjk_enabled(); | ||
294 | 246 | + | ||
295 | 247 | // Set value_ranges if we may have to handle value ranges in the query. | ||
296 | 248 | bool value_ranges; | ||
297 | 249 | value_ranges = !valrangeprocs.empty() && (qs.find("..") != string::npos); | ||
298 | 250 | @@ -958,7 +995,8 @@ | ||
299 | 251 | |||
300 | 252 | phrased_term: | ||
301 | 253 | bool was_acronym; | ||
302 | 254 | - string term = parse_term(it, end, was_acronym); | ||
303 | 255 | + bool is_cjk_term = false; | ||
304 | 256 | + string term = parse_term(it, end, cjk_ngram, is_cjk_term, was_acronym); | ||
305 | 257 | |||
306 | 258 | // Boolean operators. | ||
307 | 259 | if ((mode == DEFAULT || mode == IN_GROUP || mode == IN_GROUP2) && | ||
308 | 260 | @@ -1058,6 +1096,12 @@ | ||
309 | 261 | Term * term_obj = new Term(&state, term, prefix_info, | ||
310 | 262 | unstemmed_term, stem_term, term_pos++); | ||
311 | 263 | |||
312 | 264 | + if (is_cjk_term) { | ||
313 | 265 | + Parse(pParser, CJKTERM, term_obj, &state); | ||
314 | 266 | + if (it == end) break; | ||
315 | 267 | + continue; | ||
316 | 268 | + } | ||
317 | 269 | + | ||
318 | 270 | if (mode == DEFAULT || mode == IN_GROUP || mode == IN_GROUP2) { | ||
319 | 271 | if (it != end) { | ||
320 | 272 | if ((flags & FLAG_WILDCARD) && *it == '*') { | ||
321 | 273 | @@ -1526,6 +1570,23 @@ | ||
322 | 274 | } | ||
323 | 275 | }; | ||
324 | 276 | |||
325 | 277 | +void | ||
326 | 278 | +Term::as_positional_cjk_term(Terms * terms) const | ||
327 | 279 | +{ | ||
328 | 280 | + // Add each individual CJK character to the phrase. | ||
329 | 281 | + string t; | ||
330 | 282 | + for (Utf8Iterator it(name); it != Utf8Iterator(); ++it) { | ||
331 | 283 | + Unicode::append_utf8(t, *it); | ||
332 | 284 | + Term * c = new Term(state, t, prefix_info, unstemmed, stem, pos); | ||
333 | 285 | + terms->add_positional_term(c); | ||
334 | 286 | + t.resize(0); | ||
335 | 287 | + } | ||
336 | 288 | + | ||
337 | 289 | + // FIXME: we want to add the n-grams as filters too for efficiency. | ||
338 | 290 | + | ||
339 | 291 | + delete this; | ||
340 | 292 | +} | ||
341 | 293 | + | ||
342 | 294 | // Helper macro for converting a boolean operation into a Xapian::Query. | ||
343 | 295 | #define BOOL_OP_TO_QUERY(E, A, OP, B, OP_TXT) \ | ||
344 | 296 | do {\ | ||
345 | 297 | @@ -1909,6 +1970,10 @@ | ||
346 | 298 | delete U; | ||
347 | 299 | } | ||
348 | 300 | |||
349 | 301 | +compound_term(T) ::= CJKTERM(U). { | ||
350 | 302 | + { T = U->as_cjk_query(); } | ||
351 | 303 | +} | ||
352 | 304 | + | ||
353 | 305 | // phrase - The "inside the quotes" part of a double-quoted phrase. | ||
354 | 306 | |||
355 | 307 | %type phrase {Terms *} | ||
356 | 308 | @@ -1920,11 +1985,21 @@ | ||
357 | 309 | P->add_positional_term(T); | ||
358 | 310 | } | ||
359 | 311 | |||
360 | 312 | +phrase(P) ::= CJKTERM(T). { | ||
361 | 313 | + P = new Terms; | ||
362 | 314 | + T->as_positional_cjk_term(P); | ||
363 | 315 | +} | ||
364 | 316 | + | ||
365 | 317 | phrase(P) ::= phrase(Q) TERM(T). { | ||
366 | 318 | P = Q; | ||
367 | 319 | P->add_positional_term(T); | ||
368 | 320 | } | ||
369 | 321 | |||
370 | 322 | +phrase(P) ::= phrase(Q) CJKTERM(T). { | ||
371 | 323 | + P = Q; | ||
372 | 324 | + T->as_positional_cjk_term(P); | ||
373 | 325 | +} | ||
374 | 326 | + | ||
375 | 327 | // phrased_term - A phrased term works like a single term, but is actually | ||
376 | 328 | // 2 or more terms linked together into a phrase by punctuation. There must be | ||
377 | 329 | // at least 2 terms in order to be able to have punctuation between the terms! | ||
378 | 330 | Index: xapian-core/queryparser/queryparser_internal.h | ||
379 | 331 | =================================================================== | ||
380 | 332 | --- xapian-core.orig/queryparser/queryparser_internal.h 2011-08-24 19:09:38.000000000 -0400 | ||
381 | 333 | +++ xapian-core/queryparser/queryparser_internal.h 2011-08-24 19:40:08.916055546 -0400 | ||
382 | 334 | @@ -1,7 +1,7 @@ | ||
383 | 335 | /* queryparser_internal.h: The non-lemon-generated parts of the QueryParser | ||
384 | 336 | * class. | ||
385 | 337 | * | ||
386 | 338 | - * Copyright (C) 2005,2006,2007,2010 Olly Betts | ||
387 | 339 | + * Copyright (C) 2005,2006,2007,2010,2011 Olly Betts | ||
388 | 340 | * | ||
389 | 341 | * This program is free software; you can redistribute it and/or | ||
390 | 342 | * modify it under the terms of the GNU General Public License as | ||
391 | 343 | @@ -80,6 +80,7 @@ | ||
392 | 344 | filter_type type); | ||
393 | 345 | |||
394 | 346 | std::string parse_term(Utf8Iterator &it, const Utf8Iterator &end, | ||
395 | 347 | + bool cjk_ngram, bool &is_cjk_term, | ||
396 | 348 | bool &was_acronym); | ||
397 | 349 | |||
398 | 350 | public: | ||
399 | 351 | Index: xapian-core/queryparser/cjk-tokenizer.h | ||
400 | 352 | =================================================================== | ||
401 | 353 | --- /dev/null 1970-01-01 00:00:00.000000000 +0000 | ||
402 | 354 | +++ xapian-core/queryparser/cjk-tokenizer.h 2011-08-24 19:39:30.756055473 -0400 | ||
403 | 355 | @@ -0,0 +1,94 @@ | ||
404 | 356 | +/** @file cjk-tokenizer.h | ||
405 | 357 | + * @brief Tokenise CJK text as n-grams | ||
406 | 358 | + */ | ||
407 | 359 | +/* Copyright (c) 2007, 2008 Yung-chung Lin (henearkrxern@gmail.com) | ||
408 | 360 | + * Copyright (c) 2011 Richard Boulton (richard@tartarus.org) | ||
409 | 361 | + * Copyright (c) 2011 Brandon Schaefer (brandontschaefer@gmail.com) | ||
410 | 362 | + * Copyright (c) 2011 Olly Betts | ||
411 | 363 | + * | ||
412 | 364 | + * Permission is hereby granted, free of charge, to any person obtaining a copy | ||
413 | 365 | + * of this software and associated documentation files (the "Software"), to deal | ||
414 | 366 | + * deal in the Software without restriction, including without limitation the | ||
415 | 367 | + * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or | ||
416 | 368 | + * sell copies of the Software, and to permit persons to whom the Software is | ||
417 | 369 | + * furnished to do so, subject to the following conditions: | ||
418 | 370 | + * | ||
419 | 371 | + * The above copyright notice and this permission notice shall be included in | ||
420 | 372 | + * all copies or substantial portions of the Software. | ||
421 | 373 | + * | ||
422 | 374 | + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
423 | 375 | + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
424 | 376 | + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
425 | 377 | + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
426 | 378 | + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING | ||
427 | 379 | + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS | ||
428 | 380 | + * IN THE SOFTWARE. | ||
429 | 381 | + */ | ||
430 | 382 | + | ||
431 | 383 | +#ifndef XAPIAN_INCLUDED_CJK_TOKENIZER_H | ||
432 | 384 | +#define XAPIAN_INCLUDED_CJK_TOKENIZER_H | ||
433 | 385 | + | ||
434 | 386 | +#include "xapian/unicode.h" | ||
435 | 387 | + | ||
436 | 388 | +#include <string> | ||
437 | 389 | + | ||
438 | 390 | +namespace CJK { | ||
439 | 391 | + | ||
440 | 392 | +/** Should we use the CJK n-gram code? | ||
441 | 393 | + * | ||
442 | 394 | + * The first time this is called it reads the environmental variable | ||
443 | 395 | + * XAPIAN_CJK_NGRAM and returns true if it is set to a non-empty value. | ||
444 | 396 | + * Subsequent calls cache and return the same value. | ||
445 | 397 | + */ | ||
446 | 398 | +bool is_cjk_enabled(); | ||
447 | 399 | + | ||
448 | 400 | +bool codepoint_is_cjk(unsigned codepoint); | ||
449 | 401 | + | ||
450 | 402 | +std::string get_cjk(Xapian::Utf8Iterator &it); | ||
451 | 403 | + | ||
452 | 404 | +} | ||
453 | 405 | + | ||
454 | 406 | +class CJKTokenIterator { | ||
455 | 407 | + Xapian::Utf8Iterator it; | ||
456 | 408 | + | ||
457 | 409 | + mutable Xapian::Utf8Iterator p; | ||
458 | 410 | + | ||
459 | 411 | + mutable unsigned len; | ||
460 | 412 | + | ||
461 | 413 | + mutable std::string current_token; | ||
462 | 414 | + | ||
463 | 415 | + public: | ||
464 | 416 | + CJKTokenIterator(const std::string & s) | ||
465 | 417 | + : it(s) { } | ||
466 | 418 | + | ||
467 | 419 | + CJKTokenIterator(const Xapian::Utf8Iterator & it_) | ||
468 | 420 | + : it(it_) { } | ||
469 | 421 | + | ||
470 | 422 | + CJKTokenIterator() | ||
471 | 423 | + : it() { } | ||
472 | 424 | + | ||
473 | 425 | + const std::string & operator*() const; | ||
474 | 426 | + | ||
475 | 427 | + CJKTokenIterator & operator++(); | ||
476 | 428 | + | ||
477 | 429 | + /// Get the length of the current token in Unicode characters. | ||
478 | 430 | + unsigned get_length() const { return len; } | ||
479 | 431 | + | ||
480 | 432 | + friend bool operator==(const CJKTokenIterator &, const CJKTokenIterator &); | ||
481 | 433 | +}; | ||
482 | 434 | + | ||
483 | 435 | +inline bool | ||
484 | 436 | +operator==(const CJKTokenIterator & a, const CJKTokenIterator & b) | ||
485 | 437 | +{ | ||
486 | 438 | + // We only really care about comparisons where one or other is an end | ||
487 | 439 | + // iterator. | ||
488 | 440 | + return a.it == b.it; | ||
489 | 441 | +} | ||
490 | 442 | + | ||
491 | 443 | +inline bool | ||
492 | 444 | +operator!=(const CJKTokenIterator & a, const CJKTokenIterator & b) | ||
493 | 445 | +{ | ||
494 | 446 | + return !(a == b); | ||
495 | 447 | +} | ||
496 | 448 | + | ||
497 | 449 | +#endif // XAPIAN_INCLUDED_CJK_TOKENIZER_H | ||
498 | 450 | Index: xapian-core/queryparser/termgenerator_internal.cc | ||
499 | 451 | =================================================================== | ||
500 | 452 | --- xapian-core.orig/queryparser/termgenerator_internal.cc 2011-08-24 19:09:38.000000000 -0400 | ||
501 | 453 | +++ xapian-core/queryparser/termgenerator_internal.cc 2011-08-24 19:39:30.766055473 -0400 | ||
502 | 454 | @@ -1,7 +1,7 @@ | ||
503 | 455 | /** @file termgenerator_internal.cc | ||
504 | 456 | * @brief TermGenerator class internals | ||
505 | 457 | */ | ||
506 | 458 | -/* Copyright (C) 2007,2010 Olly Betts | ||
507 | 459 | +/* Copyright (C) 2007,2010,2011 Olly Betts | ||
508 | 460 | * | ||
509 | 461 | * This program is free software; you can redistribute it and/or modify | ||
510 | 462 | * it under the terms of the GNU General Public License as published by | ||
511 | 463 | @@ -30,6 +30,8 @@ | ||
512 | 464 | |||
513 | 465 | #include <string> | ||
514 | 466 | |||
515 | 467 | +#include "cjk-tokenizer.h" | ||
516 | 468 | + | ||
517 | 469 | using namespace std; | ||
518 | 470 | |||
519 | 471 | namespace Xapian { | ||
520 | 472 | @@ -126,6 +128,8 @@ | ||
521 | 473 | TermGenerator::Internal::index_text(Utf8Iterator itor, termcount weight, | ||
522 | 474 | const string & prefix, bool with_positions) | ||
523 | 475 | { | ||
524 | 476 | + bool cjk_ngram = CJK::is_cjk_enabled(); | ||
525 | 477 | + | ||
526 | 478 | int stop_mode = STOPWORDS_INDEX_UNSTEMMED_ONLY; | ||
527 | 479 | |||
528 | 480 | if (!stopper) stop_mode = STOPWORDS_NONE; | ||
529 | 481 | @@ -163,11 +167,53 @@ | ||
530 | 482 | } | ||
531 | 483 | |||
532 | 484 | while (true) { | ||
533 | 485 | + if (cjk_ngram && CJK::codepoint_is_cjk(*itor)) { | ||
534 | 486 | + const string & cjk = CJK::get_cjk(itor); | ||
535 | 487 | + for (CJKTokenIterator tk(cjk); tk != CJKTokenIterator(); ++tk) { | ||
536 | 488 | + const string & cjk_token = *tk; | ||
537 | 489 | + if (cjk_token.size() > MAX_PROB_TERM_LENGTH) continue; | ||
538 | 490 | + | ||
539 | 491 | + if (stop_mode == STOPWORDS_IGNORE && (*stopper)(cjk_token)) | ||
540 | 492 | + continue; | ||
541 | 493 | + | ||
542 | 494 | + if (with_positions && tk.get_length() == 1) { | ||
543 | 495 | + doc.add_posting(prefix + cjk_token, ++termpos, wdf_inc); | ||
544 | 496 | + } else { | ||
545 | 497 | + doc.add_term(prefix + cjk_token, wdf_inc); | ||
546 | 498 | + } | ||
547 | 499 | + if ((flags & FLAG_SPELLING) && prefix.empty()) | ||
548 | 500 | + db.add_spelling(cjk_token); | ||
549 | 501 | + | ||
550 | 502 | + if (!stemmer.internal.get()) continue; | ||
551 | 503 | + | ||
552 | 504 | + if (stop_mode == STOPWORDS_INDEX_UNSTEMMED_ONLY && | ||
553 | 505 | + (*stopper)(cjk_token)) | ||
554 | 506 | + continue; | ||
555 | 507 | + | ||
556 | 508 | + // Note, this uses the lowercased term, but that's OK as we | ||
557 | 509 | + // only want to avoid stemming terms starting with a digit. | ||
558 | 510 | + if (!should_stem(cjk_token)) continue; | ||
559 | 511 | + | ||
560 | 512 | + // Add stemmed form without positional information. | ||
561 | 513 | + string stem("Z"); | ||
562 | 514 | + stem += prefix; | ||
563 | 515 | + stem += stemmer(cjk_token); | ||
564 | 516 | + doc.add_term(stem, wdf_inc); | ||
565 | 517 | + } | ||
566 | 518 | + while (true) { | ||
567 | 519 | + if (itor == Utf8Iterator()) return; | ||
568 | 520 | + ch = check_wordchar(*itor); | ||
569 | 521 | + if (ch) break; | ||
570 | 522 | + ++itor; | ||
571 | 523 | + } | ||
572 | 524 | + } | ||
573 | 525 | unsigned prevch; | ||
574 | 526 | do { | ||
575 | 527 | Unicode::append_utf8(term, ch); | ||
576 | 528 | prevch = ch; | ||
577 | 529 | - if (++itor == Utf8Iterator()) goto endofterm; | ||
578 | 530 | + if (++itor == Utf8Iterator() || | ||
579 | 531 | + (cjk_ngram && CJK::codepoint_is_cjk(*itor))) | ||
580 | 532 | + goto endofterm; | ||
581 | 533 | ch = check_wordchar(*itor); | ||
582 | 534 | } while (ch); | ||
583 | 535 | |||
584 | 536 | Index: xapian-core/tests/termgentest.cc | ||
585 | 537 | =================================================================== | ||
586 | 538 | --- xapian-core.orig/tests/termgentest.cc 2011-08-24 19:09:38.000000000 -0400 | ||
587 | 539 | +++ xapian-core/tests/termgentest.cc 2011-08-24 19:39:30.766055473 -0400 | ||
588 | 540 | @@ -31,6 +31,8 @@ | ||
589 | 541 | #include "testutils.h" | ||
590 | 542 | #include "utils.h" | ||
591 | 543 | |||
592 | 544 | +#include <stdlib.h> // For setenv() or putenv() | ||
593 | 545 | + | ||
594 | 546 | using namespace std; | ||
595 | 547 | |||
596 | 548 | #define TESTCASE(S) {#S, test_##S} | ||
597 | 549 | @@ -106,12 +108,26 @@ | ||
598 | 550 | "Z\xe1\x80\x9d\xe1\x80\xae\xe1\x80\x80\xe1\x80\xae\xe1\x80\x95\xe1\x80\xad\xe1\x80\x9e\xe1\x80\xaf\xe1\x80\xb6\xe1\x80\xb8\xe1\x80\x85\xe1\x80\xbd\xe1\x80\xb2\xe1\x80\x9e\xe1\x80\xb0\xe1\x80\x99\xe1\x80\xbb\xe1\x80\xac\xe1\x80\xb8\xe1\x80\x80:1 \xe1\x80\x9d\xe1\x80\xae\xe1\x80\x80\xe1\x80\xae\xe1\x80\x95\xe1\x80\xad\xe1\x80\x9e\xe1\x80\xaf\xe1\x80\xb6\xe1\x80\xb8\xe1\x80\x85\xe1\x80\xbd\xe1\x80\xb2\xe1\x80\x9e\xe1\x80\xb0\xe1\x80\x99\xe1\x80\xbb\xe1\x80\xac\xe1\x80\xb8\xe1\x80\x80[1]" }, | ||
599 | 551 | |||
600 | 552 | { "", "fish+chips", "Zchip:1 Zfish:1 chips[2] fish[1]" }, | ||
601 | 553 | + | ||
602 | 554 | + // Basic CJK tests: | ||
603 | 555 | + { "stem=", "久有归天", "久[1] 久有:1 天[4] 归[3] 归天:1 有[2] 有归:1" }, | ||
604 | 556 | + { "", "극지라", "극[1] 극지:1 라[3] 지[2] 지라:1" }, | ||
605 | 557 | + { "", "ウルス アップ", "ア[4] ウ[1] ウル:1 ス[3] ッ[5] ップ:1 プ[6] ル[2] ルス:1" }, | ||
606 | 558 | + | ||
607 | 559 | + // CJK with prefix: | ||
608 | 560 | + { "prefix=XA", "发送从", "XA从[3] XA发[1] XA发送:1 XA送[2] XA送从:1" }, | ||
609 | 561 | + { "prefix=XA", "点卡思考", "XA卡[2] XA卡思:1 XA思[3] XA思考:1 XA点[1] XA点卡:1 XA考[4]" }, | ||
610 | 562 | + | ||
611 | 563 | + // CJK mixed with non-CJK: | ||
612 | 564 | + { "prefix=", "インtestタ", "test[3] イ[1] イン:1 タ[4] ン[2]" }, | ||
613 | 565 | + { "", "配this is合a个 test!", "a[5] is[3] test[7] this[2] 个[6] 合[4] 配[1]" }, | ||
614 | 566 | + | ||
615 | 567 | // All following tests are for things which we probably don't really want to | ||
616 | 568 | // behave as they currently do, but we haven't found a sufficiently general | ||
617 | 569 | // way to implement them yet. | ||
618 | 570 | |||
619 | 571 | // Test number like things | ||
620 | 572 | - { "", "11:59", "11[1] 59[2]" }, | ||
621 | 573 | + { "stem=en", "11:59", "11[1] 59[2]" }, | ||
622 | 574 | { "", "11:59am", "11[1] 59am[2]" }, | ||
623 | 575 | |||
624 | 576 | { NULL, NULL, NULL } | ||
625 | 577 | @@ -770,6 +786,14 @@ | ||
626 | 578 | |||
627 | 579 | int main(int argc, char **argv) | ||
628 | 580 | try { | ||
629 | 581 | + // FIXME: It would be better to test with and without XAPIAN_CJK_NGRAM set. | ||
630 | 582 | +#ifdef __WIN32__ | ||
631 | 583 | + _putenv_s("XAPIAN_CJK_NGRAM", "1"); | ||
632 | 584 | +#elif defined HAVE_SETENV | ||
633 | 585 | + setenv("XAPIAN_CJK_NGRAM", "1", 1); | ||
634 | 586 | +#else | ||
635 | 587 | + putenv(const_cast<char*>("XAPIAN_CJK_NGRAM=1")); | ||
636 | 588 | +#endif | ||
637 | 589 | test_driver::parse_command_line(argc, argv); | ||
638 | 590 | return test_driver::run(tests); | ||
639 | 591 | } catch (const char * e) { | ||
640 | 592 | Index: xapian-core/tests/queryparsertest.cc | ||
641 | 593 | =================================================================== | ||
642 | 594 | --- xapian-core.orig/tests/queryparsertest.cc 2011-08-24 19:09:38.000000000 -0400 | ||
643 | 595 | +++ xapian-core/tests/queryparsertest.cc 2011-08-24 19:39:30.766055473 -0400 | ||
644 | 596 | @@ -33,6 +33,8 @@ | ||
645 | 597 | #include <string> | ||
646 | 598 | #include <vector> | ||
647 | 599 | |||
648 | 600 | +#include <stdlib.h> // For setenv() or putenv() | ||
649 | 601 | + | ||
650 | 602 | using namespace std; | ||
651 | 603 | |||
652 | 604 | #define TESTCASE(S) {#S, test_##S} | ||
653 | 605 | @@ -639,6 +641,17 @@ | ||
654 | 606 | { "multisite:xapian.org site:www.xapian.org author:richard authortitle:richard", "((ZArichard:(pos=1) OR ZArichard:(pos=2) OR ZXTrichard:(pos=2)) FILTER (Hwww.xapian.org AND (Hxapian.org OR Jxapian.org)))"}, | ||
655 | 607 | { "authortitle:richard-boulton", "((Arichard:(pos=1) PHRASE 2 Aboulton:(pos=2)) OR (XTrichard:(pos=1) PHRASE 2 XTboulton:(pos=2)))"}, | ||
656 | 608 | { "authortitle:\"richard boulton\"", "((Arichard:(pos=1) PHRASE 2 Aboulton:(pos=2)) OR (XTrichard:(pos=1) PHRASE 2 XTboulton:(pos=2)))"}, | ||
657 | 609 | + // Some CJK tests. | ||
658 | 610 | + { "久有归天愿", "(久:(pos=1) AND 久有:(pos=1) AND 有:(pos=1) AND 有归:(pos=1) AND 归:(pos=1) AND 归天:(pos=1) AND 天:(pos=1) AND 天愿:(pos=1) AND 愿:(pos=1))" }, | ||
659 | 611 | + { "title:久有 归 天愿", "((XT久:(pos=1) AND XT久有:(pos=1) AND XT有:(pos=1)) OR 归:(pos=2) OR (天:(pos=3) AND 天愿:(pos=3) AND 愿:(pos=3)))" }, | ||
660 | 612 | + { "h众ello万众", "(Zh:(pos=1) OR 众:(pos=2) OR Zello:(pos=3) OR (万:(pos=4) AND 万众:(pos=4) AND 众:(pos=4)))" }, | ||
661 | 613 | + { "世(の中)TEST_tm", "(世:(pos=1) OR (の:(pos=2) AND の中:(pos=2) AND 中:(pos=2)) OR test_tm:(pos=3))" }, | ||
662 | 614 | + { "다녀 AND 와야", "(다:(pos=1) AND 다녀:(pos=1) AND 녀:(pos=1) AND 와:(pos=2) AND 와야:(pos=2) AND 야:(pos=2))" }, | ||
663 | 615 | + { "authortitle:학술 OR 연구를", "((A학:(pos=1) AND XT학:(pos=1) AND A학술:(pos=1) AND XT학술:(pos=1) AND A술:(pos=1) AND XT술:(pos=1)) OR (연:(pos=2) AND 연구:(pos=2) AND 구:(pos=2) AND 구를:(pos=2) AND 를:(pos=2)))" }, | ||
664 | 616 | + // FIXME: These should really filter by bigrams to accelerate: | ||
665 | 617 | + { "\"久有归\"", "(久:(pos=1) PHRASE 3 有:(pos=1) PHRASE 3 归:(pos=1))" }, | ||
666 | 618 | + { "\"久有test归\"", "(久:(pos=1) PHRASE 4 有:(pos=1) PHRASE 4 test:(pos=2) PHRASE 4 归:(pos=3))" }, | ||
667 | 619 | + // FIXME: this should work: { "久 NEAR 有", "(久:(pos=1) NEAR 11 有:(pos=2))" }, | ||
668 | 620 | { NULL, NULL } | ||
669 | 621 | }; | ||
670 | 622 | |||
671 | 623 | @@ -709,6 +722,9 @@ | ||
672 | 624 | // Add coverage for other cases similar to the above. | ||
673 | 625 | { "a b site:xapian.org", "((Za:(pos=1) AND Zb:(pos=2)) FILTER Hxapian.org)" }, | ||
674 | 626 | { "site:xapian.org a b", "((Za:(pos=1) AND Zb:(pos=2)) FILTER Hxapian.org)" }, | ||
675 | 627 | + // Some CJK tests. | ||
676 | 628 | + { "author:험가 OR subject:万众 hello world!", "((A험:(pos=1) AND A험가:(pos=1) AND A가:(pos=1)) OR (XT万:(pos=2) AND XT万众:(pos=2) AND XT众:(pos=2) AND Zhello:(pos=3) AND Zworld:(pos=4)))" }, | ||
677 | 629 | + { "洛伊one儿差点two脸three", "(洛:(pos=1) AND 洛伊:(pos=1) AND 伊:(pos=1) AND Zone:(pos=2) AND 儿:(pos=3) AND 儿差:(pos=3) AND 差:(pos=3) AND 差点:(pos=3) AND 点:(pos=3) AND Ztwo:(pos=4) AND 脸:(pos=5) AND Zthree:(pos=6))" }, | ||
678 | 630 | { NULL, NULL } | ||
679 | 631 | }; | ||
680 | 632 | |||
681 | 633 | @@ -761,6 +777,8 @@ | ||
682 | 634 | TEST_STRINGS_EQUAL(qobj.get_description(), "Xapian::Query((ZAme:(pos=1) OR ZXTstuff:(pos=2)))"); | ||
683 | 635 | qobj = qp.parse_query("title:(stuff) me", Xapian::QueryParser::FLAG_BOOLEAN, "A"); | ||
684 | 636 | TEST_STRINGS_EQUAL(qobj.get_description(), "Xapian::Query((ZXTstuff:(pos=1) OR ZAme:(pos=2)))"); | ||
685 | 637 | + qobj = qp.parse_query("英国 title:文森hello", 0, "A"); | ||
686 | 638 | + TEST_STRINGS_EQUAL(qobj.get_description(), "Xapian::Query(((A英:(pos=1) AND A英国:(pos=1) AND A国:(pos=1)) OR (XT文:(pos=2) AND XT文森:(pos=2) AND XT森:(pos=2)) OR ZAhello:(pos=3)))"); | ||
687 | 639 | return true; | ||
688 | 640 | } | ||
689 | 641 | |||
690 | 642 | @@ -2441,6 +2459,14 @@ | ||
691 | 643 | |||
692 | 644 | int main(int argc, char **argv) | ||
693 | 645 | try { | ||
694 | 646 | + // FIXME: It would be better to test with and without XAPIAN_CJK_NGRAM set. | ||
695 | 647 | +#ifdef __WIN32__ | ||
696 | 648 | + _putenv_s("XAPIAN_CJK_NGRAM", "1"); | ||
697 | 649 | +#elif defined HAVE_SETENV | ||
698 | 650 | + setenv("XAPIAN_CJK_NGRAM", "1", 1); | ||
699 | 651 | +#else | ||
700 | 652 | + putenv(const_cast<char*>("XAPIAN_CJK_NGRAM=1")); | ||
701 | 653 | +#endif | ||
702 | 654 | test_driver::parse_command_line(argc, argv); | ||
703 | 655 | return test_driver::run(tests); | ||
704 | 656 | } catch (const char * e) { | ||
705 | 657 | Index: xapian-core/ChangeLog | ||
706 | 658 | =================================================================== | ||
707 | 659 | --- xapian-core.orig/ChangeLog 2011-08-24 19:09:38.000000000 -0400 | ||
708 | 660 | +++ xapian-core/ChangeLog 2011-08-24 19:42:18.056055791 -0400 | ||
709 | 661 | @@ -1,3 +1,17 @@ | ||
710 | 662 | +Wed Aug 24 14:25:21 GMT 2011 Olly Betts <olly@survex.com> | ||
711 | 663 | + | ||
712 | 664 | + * Backport change from trunk: | ||
713 | 665 | + * queryparser/queryparser.lemony: Fix memory leak (caught by existing | ||
714 | 666 | + testcase queryparser1 when run under valgrind). | ||
715 | 667 | + | ||
716 | 668 | +Wed Aug 24 14:13:24 GMT 2011 Olly Betts <olly@survex.com> | ||
717 | 669 | + | ||
718 | 670 | + * Backport change from trunk: | ||
719 | 671 | + * queryparser/,tests/queryparsertest.cc,tests/termgentest.cc: Add | ||
720 | 672 | + support for indexing and searching CJK text using n-grams. Currently | ||
721 | 673 | + this is only enabled if environmental variable XAPIAN_CJK_NGRAM is | ||
722 | 674 | + set to a non-empty value. | ||
723 | 675 | + | ||
724 | 676 | Mon Apr 04 14:41:33 GMT 2011 Olly Betts <olly@survex.com> | ||
725 | 677 | |||
726 | 678 | * NEWS: Final update for 1.2.5. |