Merge lp:~exarkun/divmod.org/spambayes-fewer-potatoes into lp:divmod.org

Proposed by Jean-Paul Calderone
Status: Merged
Approved by: Tristan Seligmann
Approved revision: 2713
Merged at revision: 2698
Proposed branch: lp:~exarkun/divmod.org/spambayes-fewer-potatoes
Merge into: lp:divmod.org
Diff against target: 327 lines (+222/-14)
3 files modified
Quotient/benchmarks/spambayes (+44/-0)
Quotient/xquotient/spam.py (+91/-14)
Quotient/xquotient/test/test_spambayes.py (+87/-0)
To merge this branch: bzr merge lp:~exarkun/divmod.org/spambayes-fewer-potatoes
Reviewer Review Type Date Requested Status
Tristan Seligmann Approve
Review via email: mp+121094@code.launchpad.net

Description of the change

Some speedups to the new spambayes database layer. Benchmark results on the ticket (sorry, Launchpad, how do you work exactly?)

To post a comment you must log in.
2713. By Jean-Paul Calderone

Only load 999 word info rows at a time, since that is the maximum number of variables allowed in a SQLite3 statement.

Revision history for this message
Tristan Seligmann (mithrandi) wrote :

Looking at this, it occurs to me that it would be nice if Axiom had an intermediate SQL-construction layer that is usable without the full ORM. Then again, maybe that's just called "Storm", and it's certainly out of scope for this branch ;)

The code looks reasonably good to me, the extensive docstrings even better; please merge.

review: Approve
Revision history for this message
Glyph Lefkowitz (glyph) wrote :

> Looking at this, it occurs to me that it would be nice if Axiom had an
> intermediate SQL-construction layer that is usable without the full ORM. Then
> again, maybe that's just called "Storm", and it's certainly out of scope for
> this branch ;)

http://trac.calendarserver.org/browser/CalendarServer/trunk/twext/enterprise/dal/syntax.py

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== added directory 'Quotient/benchmarks'
=== added file 'Quotient/benchmarks/spambayes'
--- Quotient/benchmarks/spambayes 1970-01-01 00:00:00 +0000
+++ Quotient/benchmarks/spambayes 2012-08-25 13:12:19 +0000
@@ -0,0 +1,44 @@
1#!/usr/bin/python
2
3# Benchmark of Quotient spambayes filter, both training and classification.
4
5import sys, tempfile, random, time
6
7from xquotient.spam import _SQLite3Classifier
8
9words = list(open('/usr/share/dict/words', 'r'))
10
11TRAINING_FACTOR = 50
12MESSAGE_FACTOR = 500
13
14def adj(duration):
15 return duration / (TRAINING_FACTOR * MESSAGE_FACTOR) * 1000.0
16
17
18def main(argv):
19 prng = random.Random()
20 prng.seed(12345)
21 prng.shuffle(words)
22
23 classifier = _SQLite3Classifier(tempfile.mktemp())
24
25 before = time.time()
26 for i in range(TRAINING_FACTOR):
27 classifier.learn(words[i:i + MESSAGE_FACTOR], True)
28
29 for i in range(TRAINING_FACTOR, TRAINING_FACTOR * 2):
30 classifier.learn(words[i:i + MESSAGE_FACTOR], False)
31 after = time.time()
32
33 print 'Learning: %.2f ms/word' % (adj(after - before),)
34
35 before = time.time()
36 for i in range(TRAINING_FACTOR * 2):
37 classifier.spamprob(words[i:i + MESSAGE_FACTOR])
38 after = time.time()
39
40 print 'Guessing: %.2f ms/word' % (adj(after - before),)
41
42
43if __name__ == '__main__':
44 main(sys.argv)
045
=== modified file 'Quotient/xquotient/spam.py'
--- Quotient/xquotient/spam.py 2012-08-20 19:51:00 +0000
+++ Quotient/xquotient/spam.py 2012-08-25 13:12:19 +0000
@@ -382,6 +382,21 @@
382 statements. These are executed any time the classifier database is382 statements. These are executed any time the classifier database is
383 opened, with the expected failure which occurs any time the schema has383 opened, with the expected failure which occurs any time the schema has
384 already been initialized handled and disregarded.384 already been initialized handled and disregarded.
385
386 @ivar _readCache: Word information that is already known, either because it
387 has already been read from the database once or because we wrote the
388 information to the database. Keys are unicode tokens, values are
389 three-sequences of token, nspam, and nham counts. This is used to hold
390 word info between two different Spambayes hooks, C{_getclues} and
391 C{_wordinfoget}. The former has access to all tokens in a particular
392 document, the latter is a potato-programming mistake. Loading all of
393 the values at once in C{_getclues} is a big performance win.
394
395 @ivar _writeCache: Word information that is on its way to the database due
396 to training. This has the same shape as C{_readCache}. Word info is
397 held here until training on one document is complete, then all the word
398 info is dumped into the database in a single SQL operation (via
399 I{executemany}).
385 """400 """
386401
387 SCHEMA = [402 SCHEMA = [
@@ -413,6 +428,7 @@
413 return get, set, None, doc428 return get, set, None, doc
414 nspam = property(*nspam())429 nspam = property(*nspam())
415430
431
416 def nham():432 def nham():
417 doc = """433 doc = """
418 A property which reflects the number of messages trained as ham, while434 A property which reflects the number of messages trained as ham, while
@@ -438,6 +454,9 @@
438 """454 """
439 classifier.Classifier.__init__(self)455 classifier.Classifier.__init__(self)
440 self.databaseName = databaseName456 self.databaseName = databaseName
457 self._readCache = {}
458 self._writeCache = {}
459
441 # Open the database, possibly initializing it if it has not yet been460 # Open the database, possibly initializing it if it has not yet been
442 # initialized, and then load the necessary global state from it (nspam,461 # initialized, and then load the necessary global state from it (nspam,
443 # nham).462 # nham).
@@ -484,6 +503,51 @@
484 self.cursor.execute('UPDATE state SET nspam=?, nham=?', (self._nspam, self._nham))503 self.cursor.execute('UPDATE state SET nspam=?, nham=?', (self._nspam, self._nham))
485504
486505
506 def _getclues(self, wordstream):
507 """
508 Hook into the classification process to speed it up.
509
510 See the base implementation for details about what C{_getclues} is
511 supposed to do. This implementation extends the base to look into
512 wordstream and load all the necessary information with the minimum
513 amount of SQLite3 work, then calls up to the base implementation to let
514 it do the actual classification-related work.
515
516 @param wordstream: An iterable (probably a generator) of tokens from the
517 document to be classified.
518 """
519 # Make sure we can consume it and give it to the base implementation for
520 # consumption.
521 wordstream = list(wordstream)
522
523 # Find all the tokens we don't have in memory already
524 missing = []
525 for word in wordstream:
526 if isinstance(word, str):
527 word = word.decode('utf-8', 'replace')
528 if word not in self._readCache:
529 missing.append(word)
530
531 # Load their state
532 while missing:
533 # SQLite3 allows a maximum of 999 variables.
534 load = missing[:999]
535 del missing[:999]
536 self.cursor.execute(
537 "SELECT word, nspam, nham FROM bayes WHERE word IN (%s)" % (
538 ", ".join("?" * len(load))),
539 load)
540 rows = self.cursor.fetchall()
541
542 # Save them for later
543 for row in rows:
544 self._readCache[row[0]] = row
545
546 # Let the base class do its thing, which will involve asking us about
547 # that state we just cached.
548 return classifier.Classifier._getclues(self, wordstream)
549
550
487 def _get(self, word):551 def _get(self, word):
488 """552 """
489 Load the training data for the given word.553 Load the training data for the given word.
@@ -497,13 +561,22 @@
497 """561 """
498 if isinstance(word, str):562 if isinstance(word, str):
499 word = word.decode('utf-8', 'replace')563 word = word.decode('utf-8', 'replace')
500564 try:
501 self.cursor.execute(565 # Check to see if we already have this word's info in memory.
502 "SELECT word, nspam, nham FROM bayes WHERE word=?", (word,))566 row = self._readCache[word]
503 rows = self.cursor.fetchall()567 except KeyError:
504 if rows:568 # If not, load it from the database.
505 return rows[0]569 self.cursor.execute(
506 return None570 "SELECT word, nspam, nham FROM bayes WHERE word=?", (word,))
571 rows = self.cursor.fetchall()
572 if rows:
573 # Add it to the cache and return it.
574 self._readCache[rows[0][0]] = rows[0]
575 return rows[0]
576 return None
577 else:
578 # Otherwise return what we knew already.
579 return row
507580
508581
509 def _set(self, word, nspam, nham):582 def _set(self, word, nspam, nham):
@@ -519,10 +592,7 @@
519 """592 """
520 if isinstance(word, str):593 if isinstance(word, str):
521 word = word.decode('utf-8', 'replace')594 word = word.decode('utf-8', 'replace')
522 self.cursor.execute(595 self._readCache[word] = self._writeCache[word] = (word, nspam, nham)
523 "INSERT OR REPLACE INTO bayes (word, nspam, nham) "
524 "VALUES (?, ?, ?)",
525 (word, nspam, nham))
526596
527597
528 def _delete(self, word):598 def _delete(self, word):
@@ -532,10 +602,12 @@
532 @param word: A word (or any other kind of token) to lose training602 @param word: A word (or any other kind of token) to lose training
533 information about.603 information about.
534 @type word: C{str} or C{unicode} (but really, C{unicode} please)604 @type word: C{str} or C{unicode} (but really, C{unicode} please)
605
606 @raise NotImplementedError: Deletion is not actually supported in this
607 backend. Fortunately, Quotient does not need it (it never calls
608 C{unlearn}).
535 """609 """
536 if isinstance(word, str):610 raise NotImplementedError("There is no support for deletion.")
537 word = word.decode('utf-8', 'replace')
538 self.cursor.execute("DELETE FROM bayes WHERE word=?", (word,))
539611
540612
541 def _post_training(self):613 def _post_training(self):
@@ -545,6 +617,11 @@
545 transaction, which contains all of the database modifications for each617 transaction, which contains all of the database modifications for each
546 token in that message.618 token in that message.
547 """619 """
620 writes = self._writeCache.itervalues()
621 self._writeCache = {}
622 self.cursor.executemany(
623 "INSERT OR REPLACE INTO bayes (word, nspam, nham) "
624 "VALUES (?, ?, ?)", writes)
548 self.db.commit()625 self.db.commit()
549626
550627
551628
=== modified file 'Quotient/xquotient/test/test_spambayes.py'
--- Quotient/xquotient/test/test_spambayes.py 2012-08-20 19:53:37 +0000
+++ Quotient/xquotient/test/test_spambayes.py 2012-08-25 13:12:19 +0000
@@ -42,6 +42,54 @@
42 self.assertEqual(bayes.nspam, 1)42 self.assertEqual(bayes.nspam, 1)
4343
4444
45 def test_spamTokenRecorded(self):
46 """
47 The first time a token is encountered during spam training, a row is
48 inserted into the database counting it as once a spam token, never a ham
49 token.
50 """
51 self.classifier.train(StringIO("spam bad gross"), True)
52 bayes = spam._SQLite3Classifier(self.path)
53 wordInfo = bayes._get(u"spam")
54 self.assertEqual((u"spam", 1, 0), wordInfo)
55
56
57 def test_hamTokenRecorded(self):
58 """
59 The first time a token is encountered during ham training, a row is
60 inserted into the database counting it as never a spam token, once a ham
61 token.
62 """
63 self.classifier.train(StringIO("justice sunshine puppies"), False)
64 bayes = spam._SQLite3Classifier(self.path)
65 wordInfo = bayes._get(u"sunshine")
66 self.assertEqual((u"sunshine", 0, 1), wordInfo)
67
68
69 def test_spamTokenIncremented(self):
70 """
71 Encountered on a subsequent spam training operation, an existing word
72 info row has its spam count incremented and its ham count left alone.
73 """
74 self.classifier.train(StringIO("justice sunshine puppies"), False)
75 self.classifier.train(StringIO("spam bad puppies"), True)
76 bayes = spam._SQLite3Classifier(self.path)
77 wordInfo = bayes._get(u"puppies")
78 self.assertEqual((u"puppies", 1, 1), wordInfo)
79
80
81 def test_hamTokenIncremented(self):
82 """
83 Encountered on a subsequent ham training operation, an existing word
84 info row has its spam count left alone and its ham count incremented.
85 """
86 self.classifier.train(StringIO("spam bad puppies"), True)
87 self.classifier.train(StringIO("justice sunshine puppies"), False)
88 bayes = spam._SQLite3Classifier(self.path)
89 wordInfo = bayes._get(u"puppies")
90 self.assertEqual((u"puppies", 1, 1), wordInfo)
91
92
45 def test_nham(self):93 def test_nham(self):
46 """94 """
47 L{SQLite3Classifier} tracks, in memory, the number of ham messages it95 L{SQLite3Classifier} tracks, in memory, the number of ham messages it
@@ -71,6 +119,17 @@
71 self.classifier.score(StringIO("spamfulness words of spam")) > 0.99)119 self.classifier.score(StringIO("spamfulness words of spam")) > 0.99)
72120
73121
122 def test_spamClassificationWithoutCache(self):
123 """
124 Like L{test_spamClassification}, but ensure no instance cache is used to
125 satisfied word info lookups.
126 """
127 self.classifier.train(StringIO("spam words of spamfulness"), True)
128 classifier = Hammie(spam._SQLite3Classifier(self.path), mode='r')
129 self.assertTrue(
130 classifier.score(StringIO("spamfulness words of spam")) > 0.99)
131
132
74 def test_hamClassification(self):133 def test_hamClassification(self):
75 """134 """
76 L{SQLite3Classifier} can be trained with a ham message so as to later135 L{SQLite3Classifier} can be trained with a ham message so as to later
@@ -81,6 +140,34 @@
81 self.classifier.score(StringIO("words, very nice")) < 0.01)140 self.classifier.score(StringIO("words, very nice")) < 0.01)
82141
83142
143 def test_hamClassificationWithoutCache(self):
144 """
145 Like L{test_spamClassification}, but ensure no instance cache is used to
146 satisfied word info lookups.
147 """
148 self.classifier.train(StringIO("very nice words"), False)
149 classifier = Hammie(spam._SQLite3Classifier(self.path), mode='r')
150 self.assertTrue(
151 classifier.score(StringIO("words, very nice")) < 0.01)
152
153
154
155 def test_largeDocumentClassification(self):
156 """
157 A document with more than 999 tokens can be successfully classified.
158 """
159 words = []
160 for i in range(1000):
161 word = "word%d" % (i,)
162 words.append(word)
163 document = " ".join(words)
164 self.classifier.train(StringIO(document), False)
165
166 classifier = Hammie(spam._SQLite3Classifier(self.path), mode='r')
167 self.assertTrue(
168 classifier.score(StringIO(document)) < 0.01)
169
170
84171
85class SpambayesFilterTestCase(unittest.TestCase, MessageCreationMixin):172class SpambayesFilterTestCase(unittest.TestCase, MessageCreationMixin):
86 """173 """

Subscribers

People subscribed via source and target branches