Merge lp:~reedobrien/charmworld/ngrams into lp:charmworld

Proposed by Reed O'Brien
Status: Merged
Approved by: Reed O'Brien
Approved revision: 516
Merged at revision: 509
Proposed branch: lp:~reedobrien/charmworld/ngrams
Merge into: lp:charmworld
Diff against target: 359 lines (+184/-21)
3 files modified
Makefile (+2/-2)
charmworld/search.py (+104/-8)
charmworld/tests/test_search.py (+78/-11)
To merge this branch: bzr merge lp:~reedobrien/charmworld/ngrams
Reviewer Review Type Date Requested Status
Juju Gui Bot continuous-integration Approve
Richard Harding Approve
j.c.sackett (community) Approve
Review via email: mp+218821@code.launchpad.net

Commit message

adds ngrams as multifield index to name. skips destructive test -- see Bug #1317567

Description of the change

Allow finding sugarcrm by searching for crm, or postgresql, cf-mysql, and mysql by searching for sql.

Changes:
 - Adds a multifield index for name with exact field and 3-20 ngrams analyzer.
 - Fixes and adds tests for new index.
 - Updates api_search with autocomplete=True to use the ngrams field.

Secondary changes:
 - removes one test which deleted all ES data when run
 - updates ppa:juju/pkgs -> ppa:juju/stable
 - removes svn from sysdeps

Caveats:
I think that this will analyze the query which may mean that searching with no minimum matching set may be greedier than desired. AFAIU, this accomplishes what was desired.
Relevance is subjective so I didn't try tuning for relevance. The results I get look good to me at a glance:)

Also, the minimum ngram is 3 so there *isn't likely any good reason for the gui to send requests at 1 or 2 chars in the search box*.

Once built ingest charms as follows for testing:

# ttyn
$ make run

# ttyn+1
$ bin/es-update
$ bin/ingest-queued --prefix=~cabs-team/ ## gets sugarcrm
$ bin/ingest-queued --prefix=~charmers/charms/ ## gets a bunch of charms
$ bin/ingest-queued --prefix=~bac ## gets a bundle

To post a comment you must log in.
Revision history for this message
j.c.sackett (jcsackett) wrote :

Code looks good, I'm qaing now.

Can you file a bug about the test you had to comment out?

Revision history for this message
j.c.sackett (jcsackett) wrote :

I cannot QA this successfully on my machine, but others have been able to.

review: Approve
Revision history for this message
Reed O'Brien (reedobrien) wrote :

> Code looks good, I'm qaing now.
>
> Can you file a bug about the test you had to comment out?

Bug #1317567 - http://bit.ly/1lsXCfv

Revision history for this message
Richard Harding (rharding) wrote :

QA is mixed. The results seem no worse than current so let's try this out. I did try out search for 'couch' which I would have expected to ngram match for 'couchdb' but it did not. I also tried 'build' which found 'buildbot'. I'm not sure if we hit on other matching criteria though.

As for review, let's take advantage of the ability to skip a test. I think nose will respect @unittest.skip("") rather than comment out a large block of code.

review: Needs Fixing
Revision history for this message
Richard Harding (rharding) :
review: Approve
Revision history for this message
Juju Gui Bot (juju-gui-bot) wrote :

FAILED: Autolanding.
No commit message was specified in the merge proposal. Hit 'Add commit message' on the merge proposal web page or follow the link below. You can approve the merge proposal yourself to rerun.
https://code.launchpad.net/~reedobrien/charmworld/ngrams/+merge/218821/+edit-commit-message

review: Needs Fixing (continuous-integration)
Revision history for this message
Richard Harding (rharding) wrote :

reapprove

review: Approve
Revision history for this message
Juju Gui Bot (juju-gui-bot) :
review: Approve (continuous-integration)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'Makefile'
--- Makefile 2014-02-10 21:27:57 +0000
+++ Makefile 2014-05-14 15:51:33 +0000
@@ -12,7 +12,7 @@
12PSERVE := bin/pserve12PSERVE := bin/pserve
13SYSTEM_DEPS := \13SYSTEM_DEPS := \
14 build-essential bzr libyaml-dev python-dev python-virtualenv\14 build-essential bzr libyaml-dev python-dev python-virtualenv\
15 subversion nodejs npm15 nodejs npm
1616
1717
18# Default ini file to use for all scripted commands. Can be overridden by18# Default ini file to use for all scripted commands. Can be overridden by
@@ -64,7 +64,7 @@
64sysdeps:64sysdeps:
65 sudo apt-get update65 sudo apt-get update
66 sudo apt-get install --yes python-software-properties66 sudo apt-get install --yes python-software-properties
67 sudo add-apt-repository -y ppa:juju/pkgs67 sudo add-apt-repository -y ppa:juju/stable
68 sudo add-apt-repository -y ppa:ce-orange-squad/experimental68 sudo add-apt-repository -y ppa:ce-orange-squad/experimental
69 sudo apt-get update69 sudo apt-get update
70 sudo apt-get install -y $(SYSTEM_DEPS) mongodb-server \70 sudo apt-get install -y $(SYSTEM_DEPS) mongodb-server \
7171
=== modified file 'charmworld/search.py'
--- charmworld/search.py 2014-04-14 18:09:22 +0000
+++ charmworld/search.py 2014-05-14 15:51:33 +0000
@@ -32,6 +32,7 @@
32]32]
33charm_free_text_fields = {33charm_free_text_fields = {
34 'name': 10,34 'name': 10,
35 'ngrams': None,
35 'summary': 5,36 'summary': 5,
36 'description': 3,37 'description': 3,
37 'config.options.description': None,38 'config.options.description': None,
@@ -45,6 +46,7 @@
45]46]
46bundle_free_text_fields = {47bundle_free_text_fields = {
47 'name': 10,48 'name': 10,
49 'ngrams': None,
48 'basket_name': 5,50 'basket_name': 5,
49 'description': 3,51 'description': 3,
50 'title': None,52 'title': None,
@@ -160,6 +162,62 @@
160 "index": {162 "index": {
161 "number_of_shards": settings['es_shards'],163 "number_of_shards": settings['es_shards'],
162 "number_of_replicas": settings['es_replicas'],164 "number_of_replicas": settings['es_replicas'],
165 "analysis": {
166 "filter": {
167 "n3_20grams_filter": {
168 "type": "ngram",
169 "min_gram": 3,
170 "max_gram": 20
171 }
172 },
173 "analyzer": {
174 "n3_20grams": {
175 "type": "custom",
176 "tokenizer": "standard",
177 "filter": [
178 "lowercase",
179 "n3_20grams_filter"
180 ]
181 }
182 }
183 },
184 "mappings": {
185 "charm": {
186 "properties": {
187 "name": {
188 "type": "multi_field",
189 "fields": {
190 "ngrams": {
191 "type": "string",
192 "analyzer": "n3_20grams"
193 },
194 "name": {
195 "type": "string",
196 "index": "not_analyzed"
197 }
198 }
199 }
200 }
201
202 },
203 "bundle": {
204 "properties": {
205 "name": {
206 "type": "multi_field",
207 "fields": {
208 "ngrams": {
209 "type": "string",
210 "analyzer": "n3_20grams"
211 },
212 "name": {
213 "type": "string",
214 "index": "not_analyzed"
215 }
216 }
217 }
218 }
219 }
220 }
163 }221 }
164 })222 })
165 # The ES server may need some time to actually create the index223 # The ES server may need some time to actually create the index
@@ -221,7 +279,6 @@
221 """279 """
222 charm_exact_index = [280 charm_exact_index = [
223 'categories',281 'categories',
224 'name',
225 'owner',282 'owner',
226 'i_provides',283 'i_provides',
227 'i_requires',284 'i_requires',
@@ -229,16 +286,37 @@
229 'store_url',286 'store_url',
230 '_id',287 '_id',
231 ]288 ]
289
232 inner_charm_properties = dict(290 inner_charm_properties = dict(
233 (name, {'type': 'string', 'index': 'not_analyzed'})291 (name, {'type': 'string', 'index': 'not_analyzed'})
234 for name in charm_exact_index)292 for name in charm_exact_index)
235293
236 CHARM_TEXT_INDEX = (294 charm_n3_20gram_index = [
295 'name'
296 ]
297
298 charm_text_index = (
237 'summary',299 'summary',
238 'description',300 'description',
239 )301 )
302
240 inner_charm_properties.update(dict(303 inner_charm_properties.update(dict(
241 (name, {'type': 'string'}) for name in CHARM_TEXT_INDEX))304 (name, {'type': 'string'}) for name in charm_text_index))
305
306 inner_charm_properties.update(
307 dict((name, {
308 "type": "multi_field",
309 "fields": {
310 "ngrams": {
311 "type": "string",
312 "analyzer": "n3_20grams"
313 },
314 "name": {
315 "type": "string",
316 "index": "not_analyzed"
317 }
318 }
319 }) for name in charm_n3_20gram_index))
242320
243 CHARM_INTEGER_INDEX = (321 CHARM_INTEGER_INDEX = (
244 'downloads',322 'downloads',
@@ -290,19 +368,37 @@
290 }368 }
291369
292 bundle_exact_index = [370 bundle_exact_index = [
293 'name',
294 'owner',371 'owner',
295 'series',372 'series',
296 'title',373 'title',
297 ]374 ]
298375
376 bundle_n3_20gram_index = [
377 "name",
378 ]
379
299 bundle_properties = {380 bundle_properties = {
300 'data': {381 'data': {
301 'properties': # XXX The linter won't let me indent this well.382 'properties': # XXX The linter won't let me indent this well.
302 dict((name, {'type': 'string', 'index': 'not_analyzed'})383 dict((name, {'type': 'string', 'index': 'not_analyzed'})
303 for name in bundle_exact_index)384 for name in bundle_exact_index)
304 }}385 }}
305386
387 bundle_properties['data']['properties'].update(
388 dict((name, {
389 "type": "multi_field",
390 "fields": {
391 "ngrams": {
392 "type": "string",
393 "analyzer": "n3_20grams"
394 },
395 "name": {
396 "type": "string",
397 "index": "not_analyzed"
398 }
399 }
400 }) for name in bundle_n3_20gram_index))
401
306 with translate_error():402 with translate_error():
307 try:403 try:
308 for (name, properties, dynamic) in (404 for (name, properties, dynamic) in (
@@ -407,7 +503,7 @@
407 text = text[:-1]503 text = text[:-1]
408504
409 if autocomplete:505 if autocomplete:
410 return {'prefix': {'data.name': text}}506 return {'match': {'ngrams': text}}
411507
412 charm_fields = [field + ('' if boost is None else '^%d' % boost)508 charm_fields = [field + ('' if boost is None else '^%d' % boost)
413 for field, boost in charm_free_text_fields.items()]509 for field, boost in charm_free_text_fields.items()]
@@ -425,8 +521,8 @@
425 fields.extend(other_fields)521 fields.extend(other_fields)
426 return {522 return {
427 'query_string': {523 'query_string': {
428 'query': text,524 'query': text,
429 'fields': fields,525 'fields': fields,
430 }}526 }}
431527
432 charm_dsl = {'filtered': {528 charm_dsl = {'filtered': {
433529
=== modified file 'charmworld/tests/test_search.py'
--- charmworld/tests/test_search.py 2014-04-02 14:12:22 +0000
+++ charmworld/tests/test_search.py 2014-05-14 15:51:33 +0000
@@ -8,6 +8,8 @@
8from textwrap import dedent8from textwrap import dedent
9from time import sleep9from time import sleep
1010
11import unittest
12
11from pyelasticsearch import ElasticSearch13from pyelasticsearch import ElasticSearch
12from pyelasticsearch.exceptions import (14from pyelasticsearch.exceptions import (
13 ElasticHttpError,15 ElasticHttpError,
@@ -154,6 +156,46 @@
154 # it's present.156 # it's present.
155 self.assertItemsEqual(['data', 'weight'], results[0].keys())157 self.assertItemsEqual(['data', 'weight'], results[0].keys())
156158
159 def test_ngrams_found(self):
160 # This tests that the underlying client search works
161 # not that the charmworld search API does _client != client.
162 client = self.index_client
163 test_value = "blarglefoafblatherinfinatum"
164 Charm(self.makeCharm(name=test_value))
165 query = {"query": {"match": {"ngrams": "bla"}}}
166 start, mingram = 0, 3
167 while start < len(test_value) - 3:
168 while mingram < len(test_value):
169 query['query']['match']['ngrams'] = test_value[start:mingram]
170 results = client._client.search(query, index=client.index_name)
171 self.assertEquals(results['hits']['total'], 1)
172 mingram += 1
173 start += 1
174
175 def test_ngrams_too_short(self):
176 client = self.index_client
177 test_value = "blarg"
178 Charm(self.makeCharm(name=test_value))
179 query = {"query": {"match": {"ngrams": "bl"}}}
180 start, mingram = 0, 2
181 while start < len(test_value) - 3:
182 while mingram < len(test_value):
183 query['query']['match']['ngrams'] = test_value[start:mingram]
184 results = client._client.search(query, index=client.index_name)
185 self.assertEquals(results['hits']['total'], 0)
186 mingram += 1
187 start += 1
188
189 def test_ngrams_no_matching_ngrams(self):
190 client = self.index_client
191 test_value = "blarglefoafblatherinfinatum"
192 foo_charm = Charm(self.makeCharm(name=test_value))
193 foo_charm = foo_charm # shut it pyflakes
194 query = {"query": {"match":
195 {"ngrams": "blvlsdkjfa;lsdkghoifdhvsoidufhvsdececewv"}}}
196 results = client._client.search(query, index=client.index_name)
197 self.assertEquals(results['hits']['total'], 0)
198
157 def test_search_bundle(self):199 def test_search_bundle(self):
158 bundle = Bundle(self.makeBundle())200 bundle = Bundle(self.makeBundle())
159 client = self.index_client201 client = self.index_client
@@ -516,14 +558,26 @@
516 ids = [r['data']['_id'] for r in result]558 ids = [r['data']['_id'] for r in result]
517 self.assertEqual([matching_charm['_id']], ids)559 self.assertEqual([matching_charm['_id']], ids)
518560
519 def test_autocomplete_matches_on_prefix(self):561 def test_autocomplete_matches_on_ngram(self):
520 charm1 = self.makeCharm(name='foo')562 charm1 = self.makeCharm(name='foo')
521 charm2 = self.makeCharm(name='foobar')563 charm2 = self.makeCharm(name='foobar')
564 charm3 = self.makeCharm(name='barfoo')
565 result = self.index_client.api_search(
566 'foo', autocomplete=True, min_score=0)
567 ids = [r['data']['_id'] for r in result]
568 self.assertItemsEqual([
569 charm1['_id'],
570 charm2['_id'],
571 charm3['_id']],
572 ids)
573
574 def test_autocomplete_misses_ngram(self):
575 self.makeCharm(name='foo')
576 self.makeCharm(name='foobar')
522 self.makeCharm(name='barfoo')577 self.makeCharm(name='barfoo')
523 result = self.index_client.api_search(578 result = self.index_client.api_search(
524 'foo', autocomplete=True, min_score=0)579 'bla', autocomplete=True, min_score=0)
525 ids = [r['data']['_id'] for r in result]580 self.assertEquals(result, [])
526 self.assertItemsEqual([charm1['_id'], charm2['_id']], ids)
527581
528 def test_special_characters_return_no_results(self):582 def test_special_characters_return_no_results(self):
529 self.makeCharm(name='foo')583 self.makeCharm(name='foo')
@@ -1104,7 +1158,7 @@
11041158
1105def put_incompatible_mapping(index_client):1159def put_incompatible_mapping(index_client):
1106 put_mapping(index_client, {1160 put_mapping(index_client, {
1107 'name': {'type': 'string', 'index': 'analyzed'},1161 '_id': {'type': 'string', 'index': 'analyzed'},
1108 })1162 })
11091163
11101164
@@ -1131,8 +1185,12 @@
1131 put_incompatible_mapping(index_client)1185 put_incompatible_mapping(index_client)
1132 update(index_client)1186 update(index_client)
1133 mapping = index_client.get_mapping()[CHARM]['properties']['data']1187 mapping = index_client.get_mapping()[CHARM]['properties']['data']
1134 self.assertEqual('not_analyzed',1188 self.assertEqual(
1135 mapping['properties']['name']['index'])1189 'not_analyzed',
1190 mapping['properties']['name']['fields']['name']['index'])
1191 self.assertEqual(
1192 'n3_20grams',
1193 mapping['properties']['name']['fields']['ngrams']['analyzer'])
11361194
1137 def test_update_no_index(self):1195 def test_update_no_index(self):
1138 index_client = ElasticSearchClient.from_settings(1196 index_client = ElasticSearchClient.from_settings(
@@ -1143,8 +1201,12 @@
1143 self.assertEqual([actual_client.index_name],1201 self.assertEqual([actual_client.index_name],
1144 index_client.get_aliased())1202 index_client.get_aliased())
1145 mapping = index_client.get_mapping()[CHARM]['properties']['data']1203 mapping = index_client.get_mapping()[CHARM]['properties']['data']
1146 self.assertEqual('not_analyzed',1204 self.assertEqual(
1147 mapping['properties']['name']['index'])1205 'not_analyzed',
1206 mapping['properties']['name']['fields']['name']['index'])
1207 self.assertEqual(
1208 'n3_20grams',
1209 mapping['properties']['name']['fields']['ngrams']['analyzer'])
11481210
1149 def update_to_static_mapping(self, force_reindex):1211 def update_to_static_mapping(self, force_reindex):
1150 index_client = self.use_index_client(put_mapping=False)1212 index_client = self.use_index_client(put_mapping=False)
@@ -1218,9 +1280,14 @@
1218 new_aliased = reindex(alias)1280 new_aliased = reindex(alias)
1219 self.assertEqual([new_aliased.index_name], alias.get_aliased())1281 self.assertEqual([new_aliased.index_name], alias.get_aliased())
1220 mapping = alias.get_mapping()[CHARM]['properties']['data']1282 mapping = alias.get_mapping()[CHARM]['properties']['data']
1221 self.assertEqual('not_analyzed',1283 self.assertEqual(
1222 mapping['properties']['name']['index'])1284 'n3_20grams',
1285 mapping['properties']['name']['fields']['ngrams']['analyzer'])
1286 self.assertEqual(
1287 'not_analyzed',
1288 mapping['properties']['name']['fields']['name']['index'])
12231289
1290 @unittest.skip("XXX: Bug #1317567 - http://bit.ly/1lsXCfv")
1224 def test_reindexed_no_client_charms(self):1291 def test_reindexed_no_client_charms(self):
1225 client = ElasticSearchClient.from_settings(get_ini())1292 client = ElasticSearchClient.from_settings(get_ini())
1226 # This should not raise an exception, even though the index does not1293 # This should not raise an exception, even though the index does not

Subscribers

People subscribed via source and target branches

to all changes: