Merge lp:~reedobrien/charmworld/ngrams into lp:charmworld

Proposed by Reed O'Brien
Status: Merged
Approved by: Reed O'Brien
Approved revision: 516
Merged at revision: 509
Proposed branch: lp:~reedobrien/charmworld/ngrams
Merge into: lp:charmworld
Diff against target: 359 lines (+184/-21)
3 files modified
Makefile (+2/-2)
charmworld/search.py (+104/-8)
charmworld/tests/test_search.py (+78/-11)
To merge this branch: bzr merge lp:~reedobrien/charmworld/ngrams
Reviewer Review Type Date Requested Status
Juju Gui Bot continuous-integration Approve
Richard Harding Approve
j.c.sackett (community) Approve
Review via email: mp+218821@code.launchpad.net

Commit message

adds ngrams as multifield index to name. skips destructive test -- see Bug #1317567

Description of the change

Allow finding sugarcrm by searching for crm, or postgresql, cf-mysql, and mysql by searching for sql.

Changes:
 - Adds a multifield index for name with exact field and 3-20 ngrams analyzer.
 - Fixes and adds tests for new index.
 - Updates api_search with autocomplete=True to use the ngrams field.

Secondary changes:
 - removes one test which deleted all ES data when run
 - updates ppa:juju/pkgs -> ppa:juju/stable
 - removes svn from sysdeps

Caveats:
I think that this will analyze the query which may mean that searching with no minimum matching set may be greedier than desired. AFAIU, this accomplishes what was desired.
Relevance is subjective so I didn't try tuning for relevance. The results I get look good to me at a glance:)

Also, the minimum ngram is 3 so there *isn't likely any good reason for the gui to send requests at 1 or 2 chars in the search box*.

Once built ingest charms as follows for testing:

# ttyn
$ make run

# ttyn+1
$ bin/es-update
$ bin/ingest-queued --prefix=~cabs-team/ ## gets sugarcrm
$ bin/ingest-queued --prefix=~charmers/charms/ ## gets a bunch of charms
$ bin/ingest-queued --prefix=~bac ## gets a bundle

To post a comment you must log in.
Revision history for this message
j.c.sackett (jcsackett) wrote :

Code looks good, I'm qaing now.

Can you file a bug about the test you had to comment out?

Revision history for this message
j.c.sackett (jcsackett) wrote :

I cannot QA this successfully on my machine, but others have been able to.

review: Approve
Revision history for this message
Reed O'Brien (reedobrien) wrote :

> Code looks good, I'm qaing now.
>
> Can you file a bug about the test you had to comment out?

Bug #1317567 - http://bit.ly/1lsXCfv

Revision history for this message
Richard Harding (rharding) wrote :

QA is mixed. The results seem no worse than current so let's try this out. I did try out search for 'couch' which I would have expected to ngram match for 'couchdb' but it did not. I also tried 'build' which found 'buildbot'. I'm not sure if we hit on other matching criteria though.

As for review, let's take advantage of the ability to skip a test. I think nose will respect @unittest.skip("") rather than comment out a large block of code.

review: Needs Fixing
Revision history for this message
Richard Harding (rharding) :
review: Approve
Revision history for this message
Juju Gui Bot (juju-gui-bot) wrote :

FAILED: Autolanding.
No commit message was specified in the merge proposal. Hit 'Add commit message' on the merge proposal web page or follow the link below. You can approve the merge proposal yourself to rerun.
https://code.launchpad.net/~reedobrien/charmworld/ngrams/+merge/218821/+edit-commit-message

review: Needs Fixing (continuous-integration)
Revision history for this message
Richard Harding (rharding) wrote :

reapprove

review: Approve
Revision history for this message
Juju Gui Bot (juju-gui-bot) :
review: Approve (continuous-integration)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'Makefile'
2--- Makefile 2014-02-10 21:27:57 +0000
3+++ Makefile 2014-05-14 15:51:33 +0000
4@@ -12,7 +12,7 @@
5 PSERVE := bin/pserve
6 SYSTEM_DEPS := \
7 build-essential bzr libyaml-dev python-dev python-virtualenv\
8- subversion nodejs npm
9+ nodejs npm
10
11
12 # Default ini file to use for all scripted commands. Can be overridden by
13@@ -64,7 +64,7 @@
14 sysdeps:
15 sudo apt-get update
16 sudo apt-get install --yes python-software-properties
17- sudo add-apt-repository -y ppa:juju/pkgs
18+ sudo add-apt-repository -y ppa:juju/stable
19 sudo add-apt-repository -y ppa:ce-orange-squad/experimental
20 sudo apt-get update
21 sudo apt-get install -y $(SYSTEM_DEPS) mongodb-server \
22
23=== modified file 'charmworld/search.py'
24--- charmworld/search.py 2014-04-14 18:09:22 +0000
25+++ charmworld/search.py 2014-05-14 15:51:33 +0000
26@@ -32,6 +32,7 @@
27 ]
28 charm_free_text_fields = {
29 'name': 10,
30+ 'ngrams': None,
31 'summary': 5,
32 'description': 3,
33 'config.options.description': None,
34@@ -45,6 +46,7 @@
35 ]
36 bundle_free_text_fields = {
37 'name': 10,
38+ 'ngrams': None,
39 'basket_name': 5,
40 'description': 3,
41 'title': None,
42@@ -160,6 +162,62 @@
43 "index": {
44 "number_of_shards": settings['es_shards'],
45 "number_of_replicas": settings['es_replicas'],
46+ "analysis": {
47+ "filter": {
48+ "n3_20grams_filter": {
49+ "type": "ngram",
50+ "min_gram": 3,
51+ "max_gram": 20
52+ }
53+ },
54+ "analyzer": {
55+ "n3_20grams": {
56+ "type": "custom",
57+ "tokenizer": "standard",
58+ "filter": [
59+ "lowercase",
60+ "n3_20grams_filter"
61+ ]
62+ }
63+ }
64+ },
65+ "mappings": {
66+ "charm": {
67+ "properties": {
68+ "name": {
69+ "type": "multi_field",
70+ "fields": {
71+ "ngrams": {
72+ "type": "string",
73+ "analyzer": "n3_20grams"
74+ },
75+ "name": {
76+ "type": "string",
77+ "index": "not_analyzed"
78+ }
79+ }
80+ }
81+ }
82+
83+ },
84+ "bundle": {
85+ "properties": {
86+ "name": {
87+ "type": "multi_field",
88+ "fields": {
89+ "ngrams": {
90+ "type": "string",
91+ "analyzer": "n3_20grams"
92+ },
93+ "name": {
94+ "type": "string",
95+ "index": "not_analyzed"
96+ }
97+ }
98+ }
99+ }
100+ }
101+ }
102 }
103 })
104 # The ES server may need some time to actually create the index
105@@ -221,7 +279,6 @@
106 """
107 charm_exact_index = [
108 'categories',
109- 'name',
110 'owner',
111 'i_provides',
112 'i_requires',
113@@ -229,16 +286,37 @@
114 'store_url',
115 '_id',
116 ]
117+
118 inner_charm_properties = dict(
119 (name, {'type': 'string', 'index': 'not_analyzed'})
120 for name in charm_exact_index)
121
122- CHARM_TEXT_INDEX = (
123+ charm_n3_20gram_index = [
124+ 'name'
125+ ]
126+
127+ charm_text_index = (
128 'summary',
129 'description',
130 )
131+
132 inner_charm_properties.update(dict(
133- (name, {'type': 'string'}) for name in CHARM_TEXT_INDEX))
134+ (name, {'type': 'string'}) for name in charm_text_index))
135+
136+ inner_charm_properties.update(
137+ dict((name, {
138+ "type": "multi_field",
139+ "fields": {
140+ "ngrams": {
141+ "type": "string",
142+ "analyzer": "n3_20grams"
143+ },
144+ "name": {
145+ "type": "string",
146+ "index": "not_analyzed"
147+ }
148+ }
149+ }) for name in charm_n3_20gram_index))
150
151 CHARM_INTEGER_INDEX = (
152 'downloads',
153@@ -290,19 +368,37 @@
154 }
155
156 bundle_exact_index = [
157- 'name',
158 'owner',
159 'series',
160 'title',
161 ]
162
163+ bundle_n3_20gram_index = [
164+ "name",
165+ ]
166+
167 bundle_properties = {
168 'data': {
169 'properties': # XXX The linter won't let me indent this well.
170 dict((name, {'type': 'string', 'index': 'not_analyzed'})
171- for name in bundle_exact_index)
172+ for name in bundle_exact_index)
173 }}
174
175+ bundle_properties['data']['properties'].update(
176+ dict((name, {
177+ "type": "multi_field",
178+ "fields": {
179+ "ngrams": {
180+ "type": "string",
181+ "analyzer": "n3_20grams"
182+ },
183+ "name": {
184+ "type": "string",
185+ "index": "not_analyzed"
186+ }
187+ }
188+ }) for name in bundle_n3_20gram_index))
189+
190 with translate_error():
191 try:
192 for (name, properties, dynamic) in (
193@@ -407,7 +503,7 @@
194 text = text[:-1]
195
196 if autocomplete:
197- return {'prefix': {'data.name': text}}
198+ return {'match': {'ngrams': text}}
199
200 charm_fields = [field + ('' if boost is None else '^%d' % boost)
201 for field, boost in charm_free_text_fields.items()]
202@@ -425,8 +521,8 @@
203 fields.extend(other_fields)
204 return {
205 'query_string': {
206- 'query': text,
207- 'fields': fields,
208+ 'query': text,
209+ 'fields': fields,
210 }}
211
212 charm_dsl = {'filtered': {
213
214=== modified file 'charmworld/tests/test_search.py'
215--- charmworld/tests/test_search.py 2014-04-02 14:12:22 +0000
216+++ charmworld/tests/test_search.py 2014-05-14 15:51:33 +0000
217@@ -8,6 +8,8 @@
218 from textwrap import dedent
219 from time import sleep
220
221+import unittest
222+
223 from pyelasticsearch import ElasticSearch
224 from pyelasticsearch.exceptions import (
225 ElasticHttpError,
226@@ -154,6 +156,46 @@
227 # it's present.
228 self.assertItemsEqual(['data', 'weight'], results[0].keys())
229
230+ def test_ngrams_found(self):
231+ # This tests that the underlying client search works
232+ # not that the charmworld search API does _client != client.
233+ client = self.index_client
234+ test_value = "blarglefoafblatherinfinatum"
235+ Charm(self.makeCharm(name=test_value))
236+ query = {"query": {"match": {"ngrams": "bla"}}}
237+ start, mingram = 0, 3
238+ while start < len(test_value) - 3:
239+ while mingram < len(test_value):
240+ query['query']['match']['ngrams'] = test_value[start:mingram]
241+ results = client._client.search(query, index=client.index_name)
242+ self.assertEquals(results['hits']['total'], 1)
243+ mingram += 1
244+ start += 1
245+
246+ def test_ngrams_too_short(self):
247+ client = self.index_client
248+ test_value = "blarg"
249+ Charm(self.makeCharm(name=test_value))
250+ query = {"query": {"match": {"ngrams": "bl"}}}
251+ start, mingram = 0, 2
252+ while start < len(test_value) - 3:
253+ while mingram < len(test_value):
254+ query['query']['match']['ngrams'] = test_value[start:mingram]
255+ results = client._client.search(query, index=client.index_name)
256+ self.assertEquals(results['hits']['total'], 0)
257+ mingram += 1
258+ start += 1
259+
260+ def test_ngrams_no_matching_ngrams(self):
261+ client = self.index_client
262+ test_value = "blarglefoafblatherinfinatum"
263+ foo_charm = Charm(self.makeCharm(name=test_value))
264+ foo_charm = foo_charm # shut it pyflakes
265+ query = {"query": {"match":
266+ {"ngrams": "blvlsdkjfa;lsdkghoifdhvsoidufhvsdececewv"}}}
267+ results = client._client.search(query, index=client.index_name)
268+ self.assertEquals(results['hits']['total'], 0)
269+
270 def test_search_bundle(self):
271 bundle = Bundle(self.makeBundle())
272 client = self.index_client
273@@ -516,14 +558,26 @@
274 ids = [r['data']['_id'] for r in result]
275 self.assertEqual([matching_charm['_id']], ids)
276
277- def test_autocomplete_matches_on_prefix(self):
278+ def test_autocomplete_matches_on_ngram(self):
279 charm1 = self.makeCharm(name='foo')
280 charm2 = self.makeCharm(name='foobar')
281+ charm3 = self.makeCharm(name='barfoo')
282+ result = self.index_client.api_search(
283+ 'foo', autocomplete=True, min_score=0)
284+ ids = [r['data']['_id'] for r in result]
285+ self.assertItemsEqual([
286+ charm1['_id'],
287+ charm2['_id'],
288+ charm3['_id']],
289+ ids)
290+
291+ def test_autocomplete_misses_ngram(self):
292+ self.makeCharm(name='foo')
293+ self.makeCharm(name='foobar')
294 self.makeCharm(name='barfoo')
295 result = self.index_client.api_search(
296- 'foo', autocomplete=True, min_score=0)
297- ids = [r['data']['_id'] for r in result]
298- self.assertItemsEqual([charm1['_id'], charm2['_id']], ids)
299+ 'bla', autocomplete=True, min_score=0)
300+ self.assertEquals(result, [])
301
302 def test_special_characters_return_no_results(self):
303 self.makeCharm(name='foo')
304@@ -1104,7 +1158,7 @@
305
306 def put_incompatible_mapping(index_client):
307 put_mapping(index_client, {
308- 'name': {'type': 'string', 'index': 'analyzed'},
309+ '_id': {'type': 'string', 'index': 'analyzed'},
310 })
311
312
313@@ -1131,8 +1185,12 @@
314 put_incompatible_mapping(index_client)
315 update(index_client)
316 mapping = index_client.get_mapping()[CHARM]['properties']['data']
317- self.assertEqual('not_analyzed',
318- mapping['properties']['name']['index'])
319+ self.assertEqual(
320+ 'not_analyzed',
321+ mapping['properties']['name']['fields']['name']['index'])
322+ self.assertEqual(
323+ 'n3_20grams',
324+ mapping['properties']['name']['fields']['ngrams']['analyzer'])
325
326 def test_update_no_index(self):
327 index_client = ElasticSearchClient.from_settings(
328@@ -1143,8 +1201,12 @@
329 self.assertEqual([actual_client.index_name],
330 index_client.get_aliased())
331 mapping = index_client.get_mapping()[CHARM]['properties']['data']
332- self.assertEqual('not_analyzed',
333- mapping['properties']['name']['index'])
334+ self.assertEqual(
335+ 'not_analyzed',
336+ mapping['properties']['name']['fields']['name']['index'])
337+ self.assertEqual(
338+ 'n3_20grams',
339+ mapping['properties']['name']['fields']['ngrams']['analyzer'])
340
341 def update_to_static_mapping(self, force_reindex):
342 index_client = self.use_index_client(put_mapping=False)
343@@ -1218,9 +1280,14 @@
344 new_aliased = reindex(alias)
345 self.assertEqual([new_aliased.index_name], alias.get_aliased())
346 mapping = alias.get_mapping()[CHARM]['properties']['data']
347- self.assertEqual('not_analyzed',
348- mapping['properties']['name']['index'])
349+ self.assertEqual(
350+ 'n3_20grams',
351+ mapping['properties']['name']['fields']['ngrams']['analyzer'])
352+ self.assertEqual(
353+ 'not_analyzed',
354+ mapping['properties']['name']['fields']['name']['index'])
355
356+ @unittest.skip("XXX: Bug #1317567 - http://bit.ly/1lsXCfv")
357 def test_reindexed_no_client_charms(self):
358 client = ElasticSearchClient.from_settings(get_ini())
359 # This should not raise an exception, even though the index does not

Subscribers

People subscribed via source and target branches

to all changes: