Merge lp:~jameinel/u1db/index-transformations into lp:u1db
- index-transformations
- Merge into trunk
Status: | Merged |
---|---|
Merged at revision: | 144 |
Proposed branch: | lp:~jameinel/u1db/index-transformations |
Merge into: | lp:u1db |
Diff against target: |
922 lines (+679/-51) 12 files modified
.bzrignore (+1/-0) .testr.conf (+4/-0) doc/sqlite_schema.txt (+0/-1) u1db/backends/inmemory.py (+29/-21) u1db/backends/sqlite_backend.py (+51/-23) u1db/errors.py (+4/-0) u1db/query_parser.py (+235/-0) u1db/tests/__init__.py (+0/-1) u1db/tests/test_backends.py (+57/-0) u1db/tests/test_inmemory.py (+4/-5) u1db/tests/test_query_parser.py (+277/-0) u1db/tests/test_sqlite_backend.py (+17/-0) |
To merge this branch: | bzr merge lp:~jameinel/u1db/index-transformations |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Samuele Pedroni | Approve | ||
Review via email: mp+84250@code.launchpad.net |
Commit message
Description of the change
This takes James Westby's great work on index transformations, and updates it a bit. Some thoughts:
1) This changes it so that the Getter api always thinks in terms of lists, rather than sometimes direct values and sometimes lists. It cleans up some of the internal code that had to check if thing was a list or not, and then apply it to the list, vs apply it to an item.
I think it will also match a C api better, since you don't have object types there. So you just end up with always-a-list. (still not sure what to do about ints vs strings, but I'm not worrying too much about that yet.)
2) The only one I'm not very sure about is IsNull. At the moment doing something like:
create_
Will always return a single width list. So if your document is:
'{"field": "value"}' => [True]
'{}' => [False]
'{"field": ["list", "values"]}' => [False]
'{"field": null}' => [True]
'{"field": []}' => [True]
'{"field": [null]}' => ??? I think [True]
'{"field": [1, null]}' => [False]
In James's original implementation the only one that was different was the empty-list, which would
claim that an empty list is not null.
I think this is what we want, though.
3) We should probably move how the indexing code is tested into being a backend permutation test. At least, we should add more than the small handful of tests we have now.
Samuele Pedroni (pedronis) wrote : | # |
John A Meinel (jameinel) wrote : | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 12/2/2011 3:55 PM, Samuele Pedroni wrote:
> I suppose the parser is written in a style that is easy to
> translate to C but then I would really use some kind of caching of
> parser->getter outcomes in
> SQLitePartialEx
> inserting data otherwise to be (quite?) slower than it was before
> these changes.
>
Yeah, I'm working on actually benchmarking it to make sure it matters.
It shouldn't be hard to just have a:
parsers = []
for field in fields:
parsers.
for content in docs:
raw_doc = loads(content)
for parser in parsers:
rows = parser.
> 69 + if keys is None: 70 return None
>
> that should just be if key: in the new world, the code doesn't
> break, but we don't bail out early anymore
sure, good catch.
>
> 271 + elif isinstance(raw_doc, list): 272 + # If anything in the
> list is not a simple type, the list is 273 + result = [val for val
> in raw_doc
>
> I don't understand the comment there, is it truncated?
>
yeah, it got truncated. Originally there was a loop that if *any* item
in the list was not a simple value, then it would treat the whole list
as None. I changed it to just omit the non-simple types.
>
> is it intentional to check in .testr.conf?
This is something James did. But yes. It tells the 'testr' (test
repository?) program how to run your test suite. I'm fine including it.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk7
W5kAoK6iubdWpCs
=B2hV
-----END PGP SIGNATURE-----
- 144. By John A Meinel
-
Use a for loop instead of while + counters to do _take_word
- 145. By John A Meinel
-
Pre-parse the index definition before we evaluate them on documents.
- 146. By John A Meinel
-
Add some doc strings.
John A Meinel (jameinel) wrote : | # |
This has now been updated to pre-parse the index definitions, and then apply the getters to the documents. There was one trick that current trunk creates a PRIMARY KEY index on document_fields, and this patch removes it. These tests were done with the index removed.
Here are the benchmark results:
2.530s create_index(title) trunk
3.357s create_index(title) no-caching
2.588s create_index(title) caching
The other thing to note is the effect if you create indexes with more complex definitions:
2.588s create title
2.859s create low_title
3.879s create low_split_title
3.540s create low_low_title
vs
3.357s create title
3.735s create low_title
5.121s create low_split_title
5.329s create low_low_title
Note that some of the slowdown is because the document_fields table is getting bigger. Also, we should note that the create_index time with *no* new fields inserted is:
1.925s create low_title (trunk)
(that is because trunk doesn't support lower() as an operator yet, so it gives a good baseline of functionality.)
I was a bit surprised that it was that slow, given the speeds of _iter_all_docs and simplejson.loads.
- 147. By John A Meinel
-
Fix the shortcut code.
Samuele Pedroni (pedronis) wrote : | # |
looks good, I would probably be even more aggressive and do this, to amortize parsing the getters across put_docs as well:
=== modified file 'u1db/backends/
--- u1db/backends/
+++ u1db/backends/
@@ -39,6 +39,8 @@
+ self._parser = query_parser.
+ self._cached_
def get_sync_
return SQLiteSyncTarge
@@ -199,8 +201,11 @@
def _parse_
"""Parse a field definition for an index, returning a Getter."""
- parser = query_parser.
- getter = parser.
+ try:
+ getter = self._cached_
+ except KeyError:
+ getter = self._parser.
+ self._cached_
return getter
def _update_
- 148. By John A Meinel
-
Fix up a outdated comment.
- 149. By John A Meinel
-
add a comment about longer-lived Getter caching.
Preview Diff
1 | === modified file '.bzrignore' |
2 | --- .bzrignore 2011-12-02 19:05:35 +0000 |
3 | +++ .bzrignore 2011-12-06 15:24:25 +0000 |
4 | @@ -2,3 +2,4 @@ |
5 | ./dist |
6 | ./u1db.egg-info |
7 | doc/sqlite_schema.html |
8 | +.testrepository |
9 | |
10 | === added file '.testr.conf' |
11 | --- .testr.conf 1970-01-01 00:00:00 +0000 |
12 | +++ .testr.conf 2011-12-06 15:24:25 +0000 |
13 | @@ -0,0 +1,4 @@ |
14 | +[DEFAULT] |
15 | +test_command=${PYTHON:-python} -m subunit.run $LISTOPT $IDOPTION discover u1db |
16 | +test_id_option=--load-list $IDFILE |
17 | +test_list_option=--list |
18 | |
19 | === modified file 'doc/sqlite_schema.txt' |
20 | --- doc/sqlite_schema.txt 2011-11-16 08:03:06 +0000 |
21 | +++ doc/sqlite_schema.txt 2011-12-06 15:24:25 +0000 |
22 | @@ -38,7 +38,6 @@ |
23 | doc_id TEXT, |
24 | field_name TEXT, |
25 | value TEXT, |
26 | - CONSTRAINT document_fields_pkey PRIMARY KEY (doc_id, field_name) |
27 | ); |
28 | |
29 | So if you had two documents of the form:: |
30 | |
31 | === modified file 'u1db/backends/inmemory.py' |
32 | --- u1db/backends/inmemory.py 2011-12-02 11:02:42 +0000 |
33 | +++ u1db/backends/inmemory.py 2011-12-06 15:24:25 +0000 |
34 | @@ -16,7 +16,11 @@ |
35 | |
36 | import simplejson |
37 | |
38 | -from u1db import Document, errors |
39 | +from u1db import ( |
40 | + Document, |
41 | + errors, |
42 | + query_parser, |
43 | + ) |
44 | from u1db.backends import CommonBackend, CommonSyncTarget |
45 | |
46 | |
47 | @@ -182,6 +186,8 @@ |
48 | self._name = index_name |
49 | self._definition = index_definition |
50 | self._values = {} |
51 | + parser = query_parser.Parser() |
52 | + self._getters = parser.parse_all(self._definition) |
53 | |
54 | def evaluate_json(self, doc): |
55 | """Determine the 'key' after applying this index to the doc.""" |
56 | @@ -190,32 +196,35 @@ |
57 | |
58 | def evaluate(self, obj): |
59 | """Evaluate a dict object, applying this definition.""" |
60 | - result = [] |
61 | - for field in self._definition: |
62 | - val = obj |
63 | - for subfield in field.split('.'): |
64 | - val = val.get(subfield) |
65 | - if val is None: |
66 | - return None |
67 | - result.append(val) |
68 | - return '\x01'.join(result) |
69 | + all_rows = [[]] |
70 | + for getter in self._getters: |
71 | + new_rows = [] |
72 | + keys = getter.get(obj) |
73 | + if not keys: |
74 | + return [] |
75 | + for key in keys: |
76 | + new_rows.extend([row + [key] for row in all_rows]) |
77 | + all_rows = new_rows |
78 | + all_rows = ['\x01'.join(row) for row in all_rows] |
79 | + return all_rows |
80 | |
81 | def add_json(self, doc_id, doc): |
82 | """Add this json doc to the index.""" |
83 | - key = self.evaluate_json(doc) |
84 | - if key is None: |
85 | + keys = self.evaluate_json(doc) |
86 | + if not keys: |
87 | return |
88 | - self._values.setdefault(key, []).append(doc_id) |
89 | + for key in keys: |
90 | + self._values.setdefault(key, []).append(doc_id) |
91 | |
92 | def remove_json(self, doc_id, doc): |
93 | """Remove this json doc from the index.""" |
94 | - key = self.evaluate_json(doc) |
95 | - if key is None: |
96 | - return |
97 | - doc_ids = self._values[key] |
98 | - doc_ids.remove(doc_id) |
99 | - if not doc_ids: |
100 | - del self._values[key] |
101 | + keys = self.evaluate_json(doc) |
102 | + if keys: |
103 | + for key in keys: |
104 | + doc_ids = self._values[key] |
105 | + doc_ids.remove(doc_id) |
106 | + if not doc_ids: |
107 | + del self._values[key] |
108 | |
109 | def _find_non_wildcards(self, values): |
110 | """Check if this should be a wildcard match. |
111 | @@ -288,4 +297,3 @@ |
112 | def record_sync_info(self, other_replica_uid, other_replica_generation): |
113 | self._db.set_sync_generation(other_replica_uid, |
114 | other_replica_generation) |
115 | - |
116 | |
117 | === modified file 'u1db/backends/sqlite_backend.py' |
118 | --- u1db/backends/sqlite_backend.py 2011-12-02 11:02:42 +0000 |
119 | +++ u1db/backends/sqlite_backend.py 2011-12-06 15:24:25 +0000 |
120 | @@ -21,7 +21,12 @@ |
121 | import uuid |
122 | |
123 | from u1db.backends import CommonBackend, CommonSyncTarget |
124 | -from u1db import Document, errors |
125 | +from u1db import ( |
126 | + compat, |
127 | + Document, |
128 | + errors, |
129 | + query_parser, |
130 | + ) |
131 | |
132 | |
133 | class SQLiteDatabase(CommonBackend): |
134 | @@ -136,9 +141,7 @@ |
135 | c.execute("CREATE TABLE document_fields (" |
136 | " doc_id TEXT," |
137 | " field_name TEXT," |
138 | - " value TEXT," |
139 | - " CONSTRAINT document_fields_pkey" |
140 | - " PRIMARY KEY (doc_id, field_name))") |
141 | + " value TEXT)") |
142 | # TODO: Should we include doc_id or not? By including it, the |
143 | # content can be returned directly from the index, and |
144 | # matched with the documents table, roughly saving 1 btree |
145 | @@ -194,6 +197,36 @@ |
146 | def _extra_schema_init(self, c): |
147 | """Add any extra fields, etc to the basic table definitions.""" |
148 | |
149 | + def _parse_index_definition(self, index_field): |
150 | + """Parse a field definition for an index, returning a Getter.""" |
151 | + # Note: We may want to keep a Parser object around, and cache the |
152 | + # Getter objects for a greater length of time. Specifically, if |
153 | + # you create a bunch of indexes, and then insert 50k docs, you'll |
154 | + # re-parse the indexes between puts. The time to insert the docs |
155 | + # is still likely to dominate put_doc time, though. |
156 | + parser = query_parser.Parser() |
157 | + getter = parser.parse(index_field) |
158 | + return getter |
159 | + |
160 | + def _update_indexes(self, doc_id, raw_doc, getters, db_cursor): |
161 | + """Update document_fields for a single document. |
162 | + |
163 | + :param doc_id: Identifier for this document |
164 | + :param raw_doc: The python dict representation of the document. |
165 | + :param getters: A list of [(field_name, Getter)]. Getter.get will be |
166 | + called to evaluate the index definition for this document, and the |
167 | + results will be inserted into the db. |
168 | + :param db_cursor: An sqlite Cursor. |
169 | + :return: None |
170 | + """ |
171 | + values = [] |
172 | + for field_name, getter in getters: |
173 | + for idx_value in getter.get(raw_doc): |
174 | + values.append((doc_id, field_name, idx_value)) |
175 | + if values: |
176 | + db_cursor.executemany( |
177 | + "INSERT INTO document_fields VALUES (?, ?, ?)", values) |
178 | + |
179 | def _set_replica_uid(self, replica_uid): |
180 | """Force the replica_uid to be set.""" |
181 | with self._db_handle: |
182 | @@ -550,21 +583,9 @@ |
183 | |
184 | def _evaluate_index(self, raw_doc, field): |
185 | val = raw_doc |
186 | - for subfield in field.split('.'): |
187 | - if val is None: |
188 | - return None |
189 | - val = val.get(subfield, None) |
190 | - return val |
191 | - |
192 | - def _update_indexes(self, doc_id, raw_doc, fields, db_cursor): |
193 | - values = [] |
194 | - for field_name in fields: |
195 | - idx_value = self._evaluate_index(raw_doc, field_name) |
196 | - if idx_value is not None: |
197 | - values.append((doc_id, field_name, idx_value)) |
198 | - if values: |
199 | - db_cursor.executemany( |
200 | - "INSERT INTO document_fields VALUES (?, ?, ?)", values) |
201 | + parser = query_parser.Parser() |
202 | + getter = parser.parse(field) |
203 | + return getter.get(raw_doc) |
204 | |
205 | def _put_and_update_indexes(self, old_doc, doc): |
206 | c = self._db_handle.cursor() |
207 | @@ -584,8 +605,9 @@ |
208 | if indexed_fields: |
209 | # It is expected that len(indexed_fields) is shorter than |
210 | # len(raw_doc) |
211 | - # TODO: Handle nested indexed fields. |
212 | - self._update_indexes(doc.doc_id, raw_doc, indexed_fields, c) |
213 | + getters = [(field, self._parse_index_definition(field)) |
214 | + for field in indexed_fields] |
215 | + self._update_indexes(doc.doc_id, raw_doc, getters, c) |
216 | c.execute("INSERT INTO transaction_log(doc_id) VALUES (?)", |
217 | (doc.doc_id,)) |
218 | |
219 | @@ -613,9 +635,15 @@ |
220 | yield row |
221 | |
222 | def _update_all_indexes(self, new_fields): |
223 | + """Iterate all the documents, and add content to document_fields. |
224 | + |
225 | + :param new_fields: The index definitions that need to be added. |
226 | + """ |
227 | + getters = [(field, self._parse_index_definition(field)) |
228 | + for field in new_fields] |
229 | + c = self._db_handle.cursor() |
230 | for doc_id, doc in self._iter_all_docs(): |
231 | raw_doc = simplejson.loads(doc) |
232 | - c = self._db_handle.cursor() |
233 | - self._update_indexes(doc_id, raw_doc, new_fields, c) |
234 | + self._update_indexes(doc_id, raw_doc, getters, c) |
235 | |
236 | SQLiteDatabase.register_implementation(SQLitePartialExpandDatabase) |
237 | |
238 | === modified file 'u1db/errors.py' |
239 | --- u1db/errors.py 2011-12-01 14:46:15 +0000 |
240 | +++ u1db/errors.py 2011-12-06 15:24:25 +0000 |
241 | @@ -65,6 +65,10 @@ |
242 | wire_description = "database does not exist" |
243 | |
244 | |
245 | +class IndexDefinitionParseError(U1DBError): |
246 | + """The index definition cannot be parsed.""" |
247 | + |
248 | + |
249 | class HTTPError(U1DBError): |
250 | """Unspecific HTTP errror.""" |
251 | |
252 | |
253 | === added file 'u1db/query_parser.py' |
254 | --- u1db/query_parser.py 1970-01-01 00:00:00 +0000 |
255 | +++ u1db/query_parser.py 2011-12-06 15:24:25 +0000 |
256 | @@ -0,0 +1,235 @@ |
257 | +# Copyright 2011 Canonical Ltd. |
258 | +# |
259 | +# This program is free software: you can redistribute it and/or modify it |
260 | +# under the terms of the GNU General Public License version 3, as published |
261 | +# by the Free Software Foundation. |
262 | +# |
263 | +# This program is distributed in the hope that it will be useful, but |
264 | +# WITHOUT ANY WARRANTY; without even the implied warranties of |
265 | +# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR |
266 | +# PURPOSE. See the GNU General Public License for more details. |
267 | +# |
268 | +# You should have received a copy of the GNU General Public License along |
269 | +# with this program. If not, see <http://www.gnu.org/licenses/>. |
270 | + |
271 | +"""Code for parsing Index definitions.""" |
272 | + |
273 | +import string |
274 | + |
275 | +from u1db import ( |
276 | + errors, |
277 | + ) |
278 | + |
279 | + |
280 | +class Getter(object): |
281 | + """Get values from a document based on a specification.""" |
282 | + |
283 | + def get(self, raw_doc): |
284 | + """Get a value from the document. |
285 | + |
286 | + :param raw_doc: a python dictionary to get the value from. |
287 | + :return: A list of values that match the description. |
288 | + """ |
289 | + raise NotImplementedError(self.get) |
290 | + |
291 | + |
292 | +class StaticGetter(Getter): |
293 | + """A getter that returns a defined value (independent of the doc).""" |
294 | + |
295 | + def __init__(self, value): |
296 | + """Create a StaticGetter. |
297 | + |
298 | + :param value: the value to return when get is called. |
299 | + """ |
300 | + if value is None: |
301 | + self.value = [] |
302 | + elif isinstance(value, list): |
303 | + self.value = value |
304 | + else: |
305 | + self.value = [value] |
306 | + |
307 | + def get(self, raw_doc): |
308 | + return self.value |
309 | + |
310 | + |
311 | +class ExtractField(Getter): |
312 | + """Extract a field from the document.""" |
313 | + |
314 | + def __init__(self, field): |
315 | + """Create an ExtractField object. |
316 | + |
317 | + When a document is passed to get() this will return a value |
318 | + from the document based on the field specifier passed to |
319 | + the constructor. |
320 | + |
321 | + None will be returned if the field is nonexistant, or refers to an |
322 | + object, rather than a simple type or list of simple types. |
323 | + |
324 | + :param field: a specifier for the field to return. |
325 | + This is either a field name, or a dotted field name. |
326 | + """ |
327 | + self.field = field |
328 | + |
329 | + def get(self, raw_doc): |
330 | + for subfield in self.field.split('.'): |
331 | + if isinstance(raw_doc, dict): |
332 | + raw_doc = raw_doc.get(subfield) |
333 | + else: |
334 | + return [] |
335 | + if isinstance(raw_doc, dict): |
336 | + return [] |
337 | + if raw_doc is None: |
338 | + result = [] |
339 | + elif isinstance(raw_doc, list): |
340 | + # Strip anything in the list that isn't a simple type |
341 | + result = [val for val in raw_doc |
342 | + if not isinstance(val, (dict, list))] |
343 | + else: |
344 | + result = [raw_doc] |
345 | + return result |
346 | + |
347 | + |
348 | +class Transformation(Getter): |
349 | + """A transformation on a value from another Getter.""" |
350 | + |
351 | + name = None |
352 | + """The name that the transform has in a query string.""" |
353 | + |
354 | + def __init__(self, inner): |
355 | + """Create a transformation. |
356 | + |
357 | + :param inner: the Getter to transform the value for. |
358 | + """ |
359 | + self.inner = inner |
360 | + |
361 | + def get(self, raw_doc): |
362 | + inner_values = self.inner.get(raw_doc) |
363 | + assert isinstance(inner_values, list), 'get() should always return a list' |
364 | + return self.transform(inner_values) |
365 | + |
366 | + def transform(self, values): |
367 | + """Transform the values. |
368 | + |
369 | + This should be implemented by subclasses to transform the |
370 | + value when get() is called. |
371 | + |
372 | + :param values: the values from the other Getter |
373 | + :return: the transformed values. |
374 | + """ |
375 | + raise NotImplementedError(self.transform) |
376 | + |
377 | + |
378 | +class Lower(Transformation): |
379 | + """Lowercase a string. |
380 | + |
381 | + This transformation will return None for non-string inputs. However, |
382 | + it will lowercase any strings in a list, dropping any elements |
383 | + that are not strings. |
384 | + """ |
385 | + |
386 | + name = "lower" |
387 | + |
388 | + def _can_transform(self, val): |
389 | + return not isinstance(val, (int, bool, float, list, dict)) |
390 | + |
391 | + def transform(self, values): |
392 | + if not values: |
393 | + return [] |
394 | + return [val.lower() for val in values if self._can_transform(val)] |
395 | + |
396 | + |
397 | +class SplitWords(Transformation): |
398 | + """Split a string on whitespace. |
399 | + |
400 | + This Getter will return [] for non-string inputs. It will however |
401 | + split any strings in an input list, discarding any elements that |
402 | + are not strings. |
403 | + """ |
404 | + |
405 | + name = "split_words" |
406 | + |
407 | + def _can_transform(self, val): |
408 | + return not isinstance(val, (int, bool, float, list, dict)) |
409 | + |
410 | + def transform(self, values): |
411 | + if not values: |
412 | + return [] |
413 | + result = [] |
414 | + for value in values: |
415 | + if self._can_transform(value): |
416 | + # TODO: This is quadratic to search the list linearly while we |
417 | + # are appending to it. Consider using a set() instead. |
418 | + for word in value.split(): |
419 | + if word not in result: |
420 | + result.append(word) |
421 | + return result |
422 | + |
423 | + |
424 | +class IsNull(Transformation): |
425 | + """Indicate whether the input is None. |
426 | + |
427 | + This Getter returns a bool indicating whether the input is nil. |
428 | + """ |
429 | + |
430 | + name = "is_null" |
431 | + |
432 | + def transform(self, values): |
433 | + return [len(values) == 0] |
434 | + |
435 | + |
436 | +class Parser(object): |
437 | + """Parse an index expression into a sequence of transformations.""" |
438 | + |
439 | + _transformations = {} |
440 | + _word_chars = string.lowercase + string.uppercase + "._" + string.digits |
441 | + |
442 | + def _take_word(self, partial): |
443 | + word = '' |
444 | + for idx, char in enumerate(partial): |
445 | + if char not in self._word_chars: |
446 | + return partial[:idx], partial[idx:] |
447 | + return partial, '' |
448 | + |
449 | + def parse(self, field): |
450 | + inner = self._inner_parse(field) |
451 | + return inner |
452 | + |
453 | + def _inner_parse(self, field): |
454 | + word, field = self._take_word(field) |
455 | + if field.startswith("("): |
456 | + # We have an operation |
457 | + if not field.endswith(")"): |
458 | + raise errors.IndexDefinitionParseError( |
459 | + "Invalid transformation function: %s" % field) |
460 | + op = self._transformations.get(word, None) |
461 | + if op is None: |
462 | + raise errors.IndexDefinitionParseError( |
463 | + "Unknown operation: %s" % word) |
464 | + inner = self._inner_parse(field[1:-1]) |
465 | + return op(inner) |
466 | + else: |
467 | + if len(field) != 0: |
468 | + raise errors.IndexDefinitionParseError( |
469 | + "Unhandled characters: %s" % (field,)) |
470 | + if len(word) == 0: |
471 | + raise errors.IndexDefinitionParseError( |
472 | + "Missing field specifier") |
473 | + if word.endswith("."): |
474 | + raise errors.IndexDefinitionParseError( |
475 | + "Invalid field specifier: %s" % word) |
476 | + return ExtractField(word) |
477 | + |
478 | + def parse_all(self, fields): |
479 | + return [self.parse(field) for field in fields] |
480 | + |
481 | + @classmethod |
482 | + def register_transormation(cls, transform): |
483 | + assert transform.name not in cls._transformations, ( |
484 | + "Transform %s already registered for %s" |
485 | + % (transform.name, cls._transformations[transform.name])) |
486 | + cls._transformations[transform.name] = transform |
487 | + |
488 | + |
489 | +Parser.register_transormation(SplitWords) |
490 | +Parser.register_transormation(Lower) |
491 | +Parser.register_transormation(IsNull) |
492 | |
493 | === modified file 'u1db/tests/__init__.py' |
494 | --- u1db/tests/__init__.py 2011-11-28 19:51:23 +0000 |
495 | +++ u1db/tests/__init__.py 2011-12-06 15:24:25 +0000 |
496 | @@ -107,7 +107,6 @@ |
497 | |
498 | class DatabaseBaseTests(TestCase): |
499 | |
500 | - create_database = None |
501 | scenarios = LOCAL_DATABASES_SCENARIOS |
502 | |
503 | def create_database(self, replica_uid): |
504 | |
505 | === modified file 'u1db/tests/test_backends.py' |
506 | --- u1db/tests/test_backends.py 2011-12-01 14:46:15 +0000 |
507 | +++ u1db/tests/test_backends.py 2011-12-06 15:24:25 +0000 |
508 | @@ -518,6 +518,63 @@ |
509 | self.assertEqual([doc1], |
510 | self.db.get_from_index('test-idx', [('*',)])) |
511 | |
512 | + def test_get_from_index_with_lower(self): |
513 | + self.db.create_index("index", ["lower(name)"]) |
514 | + content = '{"name": "Foo"}' |
515 | + doc = self.db.create_doc(content) |
516 | + rows = self.db.get_from_index("index", [("foo", )]) |
517 | + self.assertEqual([doc], rows) |
518 | + |
519 | + def test_get_from_index_with_lower_matches_same_case(self): |
520 | + self.db.create_index("index", ["lower(name)"]) |
521 | + content = '{"name": "foo"}' |
522 | + doc = self.db.create_doc(content) |
523 | + rows = self.db.get_from_index("index", [("foo", )]) |
524 | + self.assertEqual([doc], rows) |
525 | + |
526 | + def test_index_lower_doesnt_match_different_case(self): |
527 | + self.db.create_index("index", ["lower(name)"]) |
528 | + content = '{"name": "Foo"}' |
529 | + doc = self.db.create_doc(content) |
530 | + rows = self.db.get_from_index("index", [("Foo", )]) |
531 | + self.assertEqual([], rows) |
532 | + |
533 | + def test_index_lower_doesnt_match_other_index(self): |
534 | + self.db.create_index("index", ["lower(name)"]) |
535 | + self.db.create_index("other_index", ["name"]) |
536 | + content = '{"name": "Foo"}' |
537 | + doc = self.db.create_doc(content) |
538 | + rows = self.db.get_from_index("index", [("Foo", )]) |
539 | + self.assertEqual(0, len(rows)) |
540 | + |
541 | + def test_index_list(self): |
542 | + self.db.create_index("index", ["name"]) |
543 | + content = '{"name": ["foo", "bar"]}' |
544 | + doc = self.db.create_doc(content) |
545 | + rows = self.db.get_from_index("index", [("bar", )]) |
546 | + self.assertEqual([doc], rows) |
547 | + |
548 | + def test_index_split_words_match_first(self): |
549 | + self.db.create_index("index", ["split_words(name)"]) |
550 | + content = '{"name": "foo bar"}' |
551 | + doc = self.db.create_doc(content) |
552 | + rows = self.db.get_from_index("index", [("foo", )]) |
553 | + self.assertEqual([doc], rows) |
554 | + |
555 | + def test_index_split_words_match_second(self): |
556 | + self.db.create_index("index", ["split_words(name)"]) |
557 | + content = '{"name": "foo bar"}' |
558 | + doc = self.db.create_doc(content) |
559 | + rows = self.db.get_from_index("index", [("bar", )]) |
560 | + self.assertEqual([doc], rows) |
561 | + |
562 | + def test_index_split_words_match_both(self): |
563 | + self.db.create_index("index", ["split_words(name)"]) |
564 | + content = '{"name": "foo foo"}' |
565 | + doc = self.db.create_doc(content) |
566 | + rows = self.db.get_from_index("index", [("foo", )]) |
567 | + self.assertEqual([doc], rows) |
568 | + |
569 | def test_get_partial_from_index(self): |
570 | content1 = '{"k1": "v1", "k2": "v2"}' |
571 | content2 = '{"k1": "v1", "k2": "x2"}' |
572 | |
573 | === modified file 'u1db/tests/test_inmemory.py' |
574 | --- u1db/tests/test_inmemory.py 2011-12-02 11:02:42 +0000 |
575 | +++ u1db/tests/test_inmemory.py 2011-12-06 15:24:25 +0000 |
576 | @@ -53,20 +53,20 @@ |
577 | |
578 | def test_evaluate_json(self): |
579 | idx = inmemory.InMemoryIndex('idx-name', ['key']) |
580 | - self.assertEqual('value', idx.evaluate_json(simple_doc)) |
581 | + self.assertEqual(['value'], idx.evaluate_json(simple_doc)) |
582 | |
583 | def test_evaluate_json_field_None(self): |
584 | idx = inmemory.InMemoryIndex('idx-name', ['missing']) |
585 | - self.assertEqual(None, idx.evaluate_json(simple_doc)) |
586 | + self.assertEqual([], idx.evaluate_json(simple_doc)) |
587 | |
588 | def test_evaluate_json_subfield_None(self): |
589 | idx = inmemory.InMemoryIndex('idx-name', ['key', 'missing']) |
590 | - self.assertEqual(None, idx.evaluate_json(simple_doc)) |
591 | + self.assertEqual([], idx.evaluate_json(simple_doc)) |
592 | |
593 | def test_evaluate_multi_index(self): |
594 | doc = '{"key": "value", "key2": "value2"}' |
595 | idx = inmemory.InMemoryIndex('idx-name', ['key', 'key2']) |
596 | - self.assertEqual('value\x01value2', |
597 | + self.assertEqual(['value\x01value2'], |
598 | idx.evaluate_json(doc)) |
599 | |
600 | def test_update_ignores_None(self): |
601 | @@ -119,4 +119,3 @@ |
602 | idx._find_non_wildcards, ('a', 'b', 'c', 'd')) |
603 | self.assertRaises(errors.InvalidValueForIndex, |
604 | idx._find_non_wildcards, ('*', 'b', 'c')) |
605 | - |
606 | |
607 | === added file 'u1db/tests/test_query_parser.py' |
608 | --- u1db/tests/test_query_parser.py 1970-01-01 00:00:00 +0000 |
609 | +++ u1db/tests/test_query_parser.py 2011-12-06 15:24:25 +0000 |
610 | @@ -0,0 +1,277 @@ |
611 | +# Copyright 2011 Canonical Ltd. |
612 | +# |
613 | +# This program is free software: you can redistribute it and/or modify it |
614 | +# under the terms of the GNU General Public License version 3, as published |
615 | +# by the Free Software Foundation. |
616 | +# |
617 | +# This program is distributed in the hope that it will be useful, but |
618 | +# WITHOUT ANY WARRANTY; without even the implied warranties of |
619 | +# MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR |
620 | +# PURPOSE. See the GNU General Public License for more details. |
621 | +# |
622 | +# You should have received a copy of the GNU General Public License along |
623 | +# with this program. If not, see <http://www.gnu.org/licenses/>. |
624 | + |
625 | +from u1db import ( |
626 | + errors, |
627 | + query_parser, |
628 | + tests, |
629 | + ) |
630 | + |
631 | + |
632 | +trivial_raw_doc = {} |
633 | + |
634 | +class TestStaticGetter(tests.TestCase): |
635 | + |
636 | + def test_returns_string(self): |
637 | + getter = query_parser.StaticGetter('foo') |
638 | + self.assertEqual(['foo'], getter.get(trivial_raw_doc)) |
639 | + |
640 | + def test_returns_int(self): |
641 | + getter = query_parser.StaticGetter(9) |
642 | + self.assertEqual([9], getter.get(trivial_raw_doc)) |
643 | + |
644 | + def test_returns_float(self): |
645 | + getter = query_parser.StaticGetter(9.2) |
646 | + self.assertEqual([9.2], getter.get(trivial_raw_doc)) |
647 | + |
648 | + def test_returns_None(self): |
649 | + getter = query_parser.StaticGetter(None) |
650 | + self.assertEqual([], getter.get(trivial_raw_doc)) |
651 | + |
652 | + def test_returns_list(self): |
653 | + getter = query_parser.StaticGetter(['a', 'b']) |
654 | + self.assertEqual(['a', 'b'], getter.get(trivial_raw_doc)) |
655 | + |
656 | + |
657 | +class TestExtractField(tests.TestCase): |
658 | + |
659 | + def assertExtractField(self, expected, field_name, raw_doc): |
660 | + getter = query_parser.ExtractField(field_name) |
661 | + self.assertEqual(expected, getter.get(raw_doc)) |
662 | + |
663 | + def test_get_value(self): |
664 | + self.assertExtractField(['bar'], 'foo', {'foo': 'bar'}) |
665 | + |
666 | + def test_get_value_None(self): |
667 | + self.assertExtractField([], 'foo', {'foo': None}) |
668 | + |
669 | + def test_get_value_missing_key(self): |
670 | + self.assertExtractField([], 'foo', {}) |
671 | + |
672 | + def test_get_value_subfield(self): |
673 | + self.assertExtractField(['bar'], 'foo.baz', {'foo': {'baz': 'bar'}}) |
674 | + |
675 | + def test_get_value_subfield_missing(self): |
676 | + self.assertExtractField([], 'foo.baz', {'foo': 'bar'}) |
677 | + |
678 | + def test_get_value_dict(self): |
679 | + self.assertExtractField([], 'foo', {'foo': {'baz': 'bar'}}) |
680 | + |
681 | + def test_get_value_list(self): |
682 | + self.assertExtractField(['bar', 'zap'], 'foo', {'foo': ['bar', 'zap']}) |
683 | + |
684 | + def test_get_value_mixed_list(self): |
685 | + self.assertExtractField(['bar', 'zap'], 'foo', |
686 | + {'foo': ['bar', ['baa'], 'zap', {'bing': 9}]}) |
687 | + |
688 | + def test_get_value_list_of_dicts(self): |
689 | + self.assertExtractField([], 'foo', {'foo': [{'zap': 'bar'}]}) |
690 | + |
691 | + def test_get_value_int(self): |
692 | + self.assertExtractField([9], 'foo', {'foo': 9}) |
693 | + |
694 | + def test_get_value_float(self): |
695 | + self.assertExtractField([9.2], 'foo', {'foo': 9.2}) |
696 | + |
697 | + def test_get_value_bool(self): |
698 | + self.assertExtractField([True], 'foo', {'foo': True}) |
699 | + self.assertExtractField([False], 'foo', {'foo': False}) |
700 | + |
701 | + |
702 | +class TestLower(tests.TestCase): |
703 | + |
704 | + def assertLowerGets(self, expected, input_val): |
705 | + getter = query_parser.Lower(query_parser.StaticGetter(input_val)) |
706 | + out_val = getter.get(trivial_raw_doc) |
707 | + self.assertEqual(expected, out_val) |
708 | + |
709 | + def test_inner_returns_None(self): |
710 | + self.assertLowerGets([], None) |
711 | + |
712 | + def test_inner_returns_string(self): |
713 | + self.assertLowerGets(['foo'], 'fOo') |
714 | + |
715 | + def test_inner_returns_list(self): |
716 | + self.assertLowerGets(['foo', 'bar'], ['fOo', 'bAr']) |
717 | + |
718 | + def test_inner_returns_int(self): |
719 | + self.assertLowerGets([], 9) |
720 | + |
721 | + def test_inner_returns_float(self): |
722 | + self.assertLowerGets([], 9.0) |
723 | + |
724 | + def test_inner_returns_bool(self): |
725 | + self.assertLowerGets([], True) |
726 | + |
727 | + def test_inner_returns_list_containing_int(self): |
728 | + self.assertLowerGets(['foo', 'bar'], ['fOo', 9, 'bAr']) |
729 | + |
730 | + def test_inner_returns_list_containing_float(self): |
731 | + self.assertLowerGets(['foo', 'bar'], ['fOo', 9.2, 'bAr']) |
732 | + |
733 | + def test_inner_returns_list_containing_bool(self): |
734 | + self.assertLowerGets(['foo', 'bar'], ['fOo', True, 'bAr']) |
735 | + |
736 | + def test_inner_returns_list_containing_list(self): |
737 | + # TODO: Should this be unfolding the inner list? |
738 | + self.assertLowerGets(['foo', 'bar'], ['fOo', ['bAa'], 'bAr']) |
739 | + |
740 | + def test_inner_returns_list_containing_dict(self): |
741 | + self.assertLowerGets(['foo', 'bar'], ['fOo', {'baa': 'xam'}, 'bAr']) |
742 | + |
743 | + |
744 | +class TestSplitWords(tests.TestCase): |
745 | + |
746 | + def assertSplitWords(self, expected, value): |
747 | + getter = query_parser.SplitWords(query_parser.StaticGetter(value)) |
748 | + self.assertEqual(expected, getter.get(trivial_raw_doc)) |
749 | + |
750 | + def test_inner_returns_None(self): |
751 | + self.assertSplitWords([], None) |
752 | + |
753 | + def test_inner_returns_string(self): |
754 | + self.assertSplitWords(['foo', 'bar'], 'foo bar') |
755 | + |
756 | + def test_inner_returns_list(self): |
757 | + self.assertSplitWords(['foo', 'baz', 'bar', 'sux'], |
758 | + ['foo baz', 'bar sux']) |
759 | + |
760 | + def test_deduplicates(self): |
761 | + self.assertSplitWords(['bar'], ['bar', 'bar', 'bar']) |
762 | + |
763 | + def test_inner_returns_int(self): |
764 | + self.assertSplitWords([], 9) |
765 | + |
766 | + def test_inner_returns_float(self): |
767 | + self.assertSplitWords([], 9.2) |
768 | + |
769 | + def test_inner_returns_bool(self): |
770 | + self.assertSplitWords([], True) |
771 | + |
772 | + def test_inner_returns_list_containing_int(self): |
773 | + self.assertSplitWords(['foo', 'baz', 'bar', 'sux'], |
774 | + ['foo baz', 9, 'bar sux']) |
775 | + |
776 | + def test_inner_returns_list_containing_float(self): |
777 | + self.assertSplitWords(['foo', 'baz', 'bar', 'sux'], |
778 | + ['foo baz', 9.2, 'bar sux']) |
779 | + |
780 | + def test_inner_returns_list_containing_bool(self): |
781 | + self.assertSplitWords(['foo', 'baz', 'bar', 'sux'], |
782 | + ['foo baz', True, 'bar sux']) |
783 | + |
784 | + def test_inner_returns_list_containing_list(self): |
785 | + # TODO: Expand sub-lists? |
786 | + self.assertSplitWords(['foo', 'baz', 'bar', 'sux'], |
787 | + ['foo baz', ['baa'], 'bar sux']) |
788 | + |
789 | + def test_inner_returns_list_containing_dict(self): |
790 | + self.assertSplitWords(['foo', 'baz', 'bar', 'sux'], |
791 | + ['foo baz', {'baa': 'xam'}, 'bar sux']) |
792 | + |
793 | + |
794 | +class TestIsNull(tests.TestCase): |
795 | + |
796 | + def assertIsNull(self, value): |
797 | + getter = query_parser.IsNull(query_parser.StaticGetter(value)) |
798 | + self.assertEqual([True], getter.get(trivial_raw_doc)) |
799 | + |
800 | + def assertIsNotNull(self, value): |
801 | + getter = query_parser.IsNull(query_parser.StaticGetter(value)) |
802 | + self.assertEqual([False], getter.get(trivial_raw_doc)) |
803 | + |
804 | + def test_inner_returns_None(self): |
805 | + self.assertIsNull(None) |
806 | + |
807 | + def test_inner_returns_string(self): |
808 | + self.assertIsNotNull('foo') |
809 | + |
810 | + def test_inner_returns_list(self): |
811 | + self.assertIsNotNull(['foo', 'bar']) |
812 | + |
813 | + def test_inner_returns_empty_list(self): |
814 | + # TODO: is this the behavior we want? |
815 | + self.assertIsNull([]) |
816 | + |
817 | + def test_inner_returns_int(self): |
818 | + self.assertIsNotNull(9) |
819 | + |
820 | + def test_inner_returns_float(self): |
821 | + self.assertIsNotNull(9.2) |
822 | + |
823 | + def test_inner_returns_bool(self): |
824 | + self.assertIsNotNull(True) |
825 | + |
826 | + # TODO: What about a dict? Inner is likely to return None, even though the |
827 | + # attribute does exist... |
828 | + |
829 | + |
830 | +class TestParser(tests.TestCase): |
831 | + |
832 | + def parse(self, spec): |
833 | + parser = query_parser.Parser() |
834 | + return parser.parse(spec) |
835 | + |
836 | + def parse_all(self, specs): |
837 | + parser = query_parser.Parser() |
838 | + return parser.parse_all(specs) |
839 | + |
840 | + def assertParseError(self, definition): |
841 | + self.assertRaises(errors.IndexDefinitionParseError, self.parse, |
842 | + definition) |
843 | + |
844 | + def test_parse_empty_string(self): |
845 | + self.assertRaises(errors.IndexDefinitionParseError, self.parse, "") |
846 | + |
847 | + def test_parse_field(self): |
848 | + getter = self.parse("a") |
849 | + self.assertIsInstance(getter, query_parser.ExtractField) |
850 | + self.assertEqual("a", getter.field) |
851 | + |
852 | + def test_parse_dotted_field(self): |
853 | + getter = self.parse("a.b") |
854 | + self.assertIsInstance(getter, query_parser.ExtractField) |
855 | + self.assertEqual("a.b", getter.field) |
856 | + |
857 | + def test_parse_dotted_field_nothing_after_dot(self): |
858 | + self.assertParseError("a.") |
859 | + |
860 | + def test_parse_missing_close_on_transformation(self): |
861 | + self.assertParseError("lower(a") |
862 | + |
863 | + def test_parse_missing_field_in_transformation(self): |
864 | + self.assertParseError("lower()") |
865 | + |
866 | + def test_parse_trailing_chars(self): |
867 | + self.assertParseError("lower(ab$)") |
868 | + |
869 | + def test_parse_empty_op(self): |
870 | + self.assertParseError("(ab)") |
871 | + |
872 | + def test_parse_unknown_op(self): |
873 | + self.assertParseError("no_such_operation(field)") |
874 | + |
875 | + def test_parse_transformation(self): |
876 | + getter = self.parse("lower(a)") |
877 | + self.assertIsInstance(getter, query_parser.Lower) |
878 | + self.assertIsInstance(getter.inner, query_parser.ExtractField) |
879 | + self.assertEqual("a", getter.inner.field) |
880 | + |
881 | + def test_parse_all(self): |
882 | + getters = self.parse_all(["a", "b"]) |
883 | + self.assertEqual(2, len(getters)) |
884 | + self.assertIsInstance(getters[0], query_parser.ExtractField) |
885 | + self.assertEqual("a", getters[0].field) |
886 | + self.assertIsInstance(getters[1], query_parser.ExtractField) |
887 | + self.assertEqual("b", getters[1].field) |
888 | |
889 | === modified file 'u1db/tests/test_sqlite_backend.py' |
890 | --- u1db/tests/test_sqlite_backend.py 2011-12-02 11:02:42 +0000 |
891 | +++ u1db/tests/test_sqlite_backend.py 2011-12-06 15:24:25 +0000 |
892 | @@ -23,6 +23,7 @@ |
893 | from u1db import ( |
894 | errors, |
895 | tests, |
896 | + query_parser, |
897 | ) |
898 | from u1db.backends import sqlite_backend |
899 | |
900 | @@ -115,6 +116,22 @@ |
901 | c.execute("SELECT * FROM conflicts") |
902 | c.execute("SELECT * FROM index_definitions") |
903 | |
904 | + def test__parse_index(self): |
905 | + self.db = sqlite_backend.SQLitePartialExpandDatabase(':memory:') |
906 | + g = self.db._parse_index_definition('fieldname') |
907 | + self.assertIsInstance(g, query_parser.ExtractField) |
908 | + self.assertEqual('fieldname', g.field) |
909 | + |
910 | + def test__update_indexes(self): |
911 | + self.db = sqlite_backend.SQLitePartialExpandDatabase(':memory:') |
912 | + g = self.db._parse_index_definition('fieldname') |
913 | + c = self.db._get_sqlite_handle().cursor() |
914 | + self.db._update_indexes('doc-id', {'fieldname': 'val'}, |
915 | + [('fieldname', g)], c) |
916 | + c.execute('SELECT doc_id, field_name, value FROM document_fields') |
917 | + self.assertEqual([('doc-id', 'fieldname', 'val')], |
918 | + c.fetchall()) |
919 | + |
920 | def test__set_replica_uid(self): |
921 | # Start from scratch, so that replica_uid isn't set. |
922 | self.db = sqlite_backend.SQLitePartialExpandDatabase(':memory:') |
I suppose the parser is written in a style that is easy to translate to C but then I would really use some kind of caching of parser->getter outcomes in SQLitePartialEx pandDatabase. _evaluate_ index. I would expect inserting data otherwise to be (quite?) slower than it was before these changes.
69 + if keys is None:
70 return None
that should just be if key: in the new world, the code doesn't break, but we don't bail out early anymore
271 + elif isinstance(raw_doc, list):
272 + # If anything in the list is not a simple type, the list is
273 + result = [val for val in raw_doc
I don't understand the comment there, is it truncated?
is it intentional to check in .testr.conf?