Merge lp:~exarkun/divmod.org/spambayes-fewer-potatoes into lp:divmod.org

Proposed by Jean-Paul Calderone
Status: Merged
Approved by: Tristan Seligmann
Approved revision: 2713
Merged at revision: 2698
Proposed branch: lp:~exarkun/divmod.org/spambayes-fewer-potatoes
Merge into: lp:divmod.org
Diff against target: 327 lines (+222/-14)
3 files modified
Quotient/benchmarks/spambayes (+44/-0)
Quotient/xquotient/spam.py (+91/-14)
Quotient/xquotient/test/test_spambayes.py (+87/-0)
To merge this branch: bzr merge lp:~exarkun/divmod.org/spambayes-fewer-potatoes
Reviewer Review Type Date Requested Status
Tristan Seligmann Approve
Review via email: mp+121094@code.launchpad.net

Description of the change

Some speedups to the new spambayes database layer. Benchmark results on the ticket (sorry, Launchpad, how do you work exactly?)

To post a comment you must log in.
2713. By Jean-Paul Calderone

Only load 999 word info rows at a time, since that is the maximum number of variables allowed in a SQLite3 statement.

Revision history for this message
Tristan Seligmann (mithrandi) wrote :

Looking at this, it occurs to me that it would be nice if Axiom had an intermediate SQL-construction layer that is usable without the full ORM. Then again, maybe that's just called "Storm", and it's certainly out of scope for this branch ;)

The code looks reasonably good to me, the extensive docstrings even better; please merge.

review: Approve
Revision history for this message
Glyph Lefkowitz (glyph) wrote :

> Looking at this, it occurs to me that it would be nice if Axiom had an
> intermediate SQL-construction layer that is usable without the full ORM. Then
> again, maybe that's just called "Storm", and it's certainly out of scope for
> this branch ;)

http://trac.calendarserver.org/browser/CalendarServer/trunk/twext/enterprise/dal/syntax.py

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added directory 'Quotient/benchmarks'
2=== added file 'Quotient/benchmarks/spambayes'
3--- Quotient/benchmarks/spambayes 1970-01-01 00:00:00 +0000
4+++ Quotient/benchmarks/spambayes 2012-08-25 13:12:19 +0000
5@@ -0,0 +1,44 @@
6+#!/usr/bin/python
7+
8+# Benchmark of Quotient spambayes filter, both training and classification.
9+
10+import sys, tempfile, random, time
11+
12+from xquotient.spam import _SQLite3Classifier
13+
14+words = list(open('/usr/share/dict/words', 'r'))
15+
16+TRAINING_FACTOR = 50
17+MESSAGE_FACTOR = 500
18+
19+def adj(duration):
20+ return duration / (TRAINING_FACTOR * MESSAGE_FACTOR) * 1000.0
21+
22+
23+def main(argv):
24+ prng = random.Random()
25+ prng.seed(12345)
26+ prng.shuffle(words)
27+
28+ classifier = _SQLite3Classifier(tempfile.mktemp())
29+
30+ before = time.time()
31+ for i in range(TRAINING_FACTOR):
32+ classifier.learn(words[i:i + MESSAGE_FACTOR], True)
33+
34+ for i in range(TRAINING_FACTOR, TRAINING_FACTOR * 2):
35+ classifier.learn(words[i:i + MESSAGE_FACTOR], False)
36+ after = time.time()
37+
38+ print 'Learning: %.2f ms/word' % (adj(after - before),)
39+
40+ before = time.time()
41+ for i in range(TRAINING_FACTOR * 2):
42+ classifier.spamprob(words[i:i + MESSAGE_FACTOR])
43+ after = time.time()
44+
45+ print 'Guessing: %.2f ms/word' % (adj(after - before),)
46+
47+
48+if __name__ == '__main__':
49+ main(sys.argv)
50
51=== modified file 'Quotient/xquotient/spam.py'
52--- Quotient/xquotient/spam.py 2012-08-20 19:51:00 +0000
53+++ Quotient/xquotient/spam.py 2012-08-25 13:12:19 +0000
54@@ -382,6 +382,21 @@
55 statements. These are executed any time the classifier database is
56 opened, with the expected failure which occurs any time the schema has
57 already been initialized handled and disregarded.
58+
59+ @ivar _readCache: Word information that is already known, either because it
60+ has already been read from the database once or because we wrote the
61+ information to the database. Keys are unicode tokens, values are
62+ three-sequences of token, nspam, and nham counts. This is used to hold
63+ word info between two different Spambayes hooks, C{_getclues} and
64+ C{_wordinfoget}. The former has access to all tokens in a particular
65+ document, the latter is a potato-programming mistake. Loading all of
66+ the values at once in C{_getclues} is a big performance win.
67+
68+ @ivar _writeCache: Word information that is on its way to the database due
69+ to training. This has the same shape as C{_readCache}. Word info is
70+ held here until training on one document is complete, then all the word
71+ info is dumped into the database in a single SQL operation (via
72+ I{executemany}).
73 """
74
75 SCHEMA = [
76@@ -413,6 +428,7 @@
77 return get, set, None, doc
78 nspam = property(*nspam())
79
80+
81 def nham():
82 doc = """
83 A property which reflects the number of messages trained as ham, while
84@@ -438,6 +454,9 @@
85 """
86 classifier.Classifier.__init__(self)
87 self.databaseName = databaseName
88+ self._readCache = {}
89+ self._writeCache = {}
90+
91 # Open the database, possibly initializing it if it has not yet been
92 # initialized, and then load the necessary global state from it (nspam,
93 # nham).
94@@ -484,6 +503,51 @@
95 self.cursor.execute('UPDATE state SET nspam=?, nham=?', (self._nspam, self._nham))
96
97
98+ def _getclues(self, wordstream):
99+ """
100+ Hook into the classification process to speed it up.
101+
102+ See the base implementation for details about what C{_getclues} is
103+ supposed to do. This implementation extends the base to look into
104+ wordstream and load all the necessary information with the minimum
105+ amount of SQLite3 work, then calls up to the base implementation to let
106+ it do the actual classification-related work.
107+
108+ @param wordstream: An iterable (probably a generator) of tokens from the
109+ document to be classified.
110+ """
111+ # Make sure we can consume it and give it to the base implementation for
112+ # consumption.
113+ wordstream = list(wordstream)
114+
115+ # Find all the tokens we don't have in memory already
116+ missing = []
117+ for word in wordstream:
118+ if isinstance(word, str):
119+ word = word.decode('utf-8', 'replace')
120+ if word not in self._readCache:
121+ missing.append(word)
122+
123+ # Load their state
124+ while missing:
125+ # SQLite3 allows a maximum of 999 variables.
126+ load = missing[:999]
127+ del missing[:999]
128+ self.cursor.execute(
129+ "SELECT word, nspam, nham FROM bayes WHERE word IN (%s)" % (
130+ ", ".join("?" * len(load))),
131+ load)
132+ rows = self.cursor.fetchall()
133+
134+ # Save them for later
135+ for row in rows:
136+ self._readCache[row[0]] = row
137+
138+ # Let the base class do its thing, which will involve asking us about
139+ # that state we just cached.
140+ return classifier.Classifier._getclues(self, wordstream)
141+
142+
143 def _get(self, word):
144 """
145 Load the training data for the given word.
146@@ -497,13 +561,22 @@
147 """
148 if isinstance(word, str):
149 word = word.decode('utf-8', 'replace')
150-
151- self.cursor.execute(
152- "SELECT word, nspam, nham FROM bayes WHERE word=?", (word,))
153- rows = self.cursor.fetchall()
154- if rows:
155- return rows[0]
156- return None
157+ try:
158+ # Check to see if we already have this word's info in memory.
159+ row = self._readCache[word]
160+ except KeyError:
161+ # If not, load it from the database.
162+ self.cursor.execute(
163+ "SELECT word, nspam, nham FROM bayes WHERE word=?", (word,))
164+ rows = self.cursor.fetchall()
165+ if rows:
166+ # Add it to the cache and return it.
167+ self._readCache[rows[0][0]] = rows[0]
168+ return rows[0]
169+ return None
170+ else:
171+ # Otherwise return what we knew already.
172+ return row
173
174
175 def _set(self, word, nspam, nham):
176@@ -519,10 +592,7 @@
177 """
178 if isinstance(word, str):
179 word = word.decode('utf-8', 'replace')
180- self.cursor.execute(
181- "INSERT OR REPLACE INTO bayes (word, nspam, nham) "
182- "VALUES (?, ?, ?)",
183- (word, nspam, nham))
184+ self._readCache[word] = self._writeCache[word] = (word, nspam, nham)
185
186
187 def _delete(self, word):
188@@ -532,10 +602,12 @@
189 @param word: A word (or any other kind of token) to lose training
190 information about.
191 @type word: C{str} or C{unicode} (but really, C{unicode} please)
192+
193+ @raise NotImplementedError: Deletion is not actually supported in this
194+ backend. Fortunately, Quotient does not need it (it never calls
195+ C{unlearn}).
196 """
197- if isinstance(word, str):
198- word = word.decode('utf-8', 'replace')
199- self.cursor.execute("DELETE FROM bayes WHERE word=?", (word,))
200+ raise NotImplementedError("There is no support for deletion.")
201
202
203 def _post_training(self):
204@@ -545,6 +617,11 @@
205 transaction, which contains all of the database modifications for each
206 token in that message.
207 """
208+ writes = self._writeCache.itervalues()
209+ self._writeCache = {}
210+ self.cursor.executemany(
211+ "INSERT OR REPLACE INTO bayes (word, nspam, nham) "
212+ "VALUES (?, ?, ?)", writes)
213 self.db.commit()
214
215
216
217=== modified file 'Quotient/xquotient/test/test_spambayes.py'
218--- Quotient/xquotient/test/test_spambayes.py 2012-08-20 19:53:37 +0000
219+++ Quotient/xquotient/test/test_spambayes.py 2012-08-25 13:12:19 +0000
220@@ -42,6 +42,54 @@
221 self.assertEqual(bayes.nspam, 1)
222
223
224+ def test_spamTokenRecorded(self):
225+ """
226+ The first time a token is encountered during spam training, a row is
227+ inserted into the database counting it as once a spam token, never a ham
228+ token.
229+ """
230+ self.classifier.train(StringIO("spam bad gross"), True)
231+ bayes = spam._SQLite3Classifier(self.path)
232+ wordInfo = bayes._get(u"spam")
233+ self.assertEqual((u"spam", 1, 0), wordInfo)
234+
235+
236+ def test_hamTokenRecorded(self):
237+ """
238+ The first time a token is encountered during ham training, a row is
239+ inserted into the database counting it as never a spam token, once a ham
240+ token.
241+ """
242+ self.classifier.train(StringIO("justice sunshine puppies"), False)
243+ bayes = spam._SQLite3Classifier(self.path)
244+ wordInfo = bayes._get(u"sunshine")
245+ self.assertEqual((u"sunshine", 0, 1), wordInfo)
246+
247+
248+ def test_spamTokenIncremented(self):
249+ """
250+ Encountered on a subsequent spam training operation, an existing word
251+ info row has its spam count incremented and its ham count left alone.
252+ """
253+ self.classifier.train(StringIO("justice sunshine puppies"), False)
254+ self.classifier.train(StringIO("spam bad puppies"), True)
255+ bayes = spam._SQLite3Classifier(self.path)
256+ wordInfo = bayes._get(u"puppies")
257+ self.assertEqual((u"puppies", 1, 1), wordInfo)
258+
259+
260+ def test_hamTokenIncremented(self):
261+ """
262+ Encountered on a subsequent ham training operation, an existing word
263+ info row has its spam count left alone and its ham count incremented.
264+ """
265+ self.classifier.train(StringIO("spam bad puppies"), True)
266+ self.classifier.train(StringIO("justice sunshine puppies"), False)
267+ bayes = spam._SQLite3Classifier(self.path)
268+ wordInfo = bayes._get(u"puppies")
269+ self.assertEqual((u"puppies", 1, 1), wordInfo)
270+
271+
272 def test_nham(self):
273 """
274 L{SQLite3Classifier} tracks, in memory, the number of ham messages it
275@@ -71,6 +119,17 @@
276 self.classifier.score(StringIO("spamfulness words of spam")) > 0.99)
277
278
279+ def test_spamClassificationWithoutCache(self):
280+ """
281+ Like L{test_spamClassification}, but ensure no instance cache is used to
282+ satisfied word info lookups.
283+ """
284+ self.classifier.train(StringIO("spam words of spamfulness"), True)
285+ classifier = Hammie(spam._SQLite3Classifier(self.path), mode='r')
286+ self.assertTrue(
287+ classifier.score(StringIO("spamfulness words of spam")) > 0.99)
288+
289+
290 def test_hamClassification(self):
291 """
292 L{SQLite3Classifier} can be trained with a ham message so as to later
293@@ -81,6 +140,34 @@
294 self.classifier.score(StringIO("words, very nice")) < 0.01)
295
296
297+ def test_hamClassificationWithoutCache(self):
298+ """
299+ Like L{test_spamClassification}, but ensure no instance cache is used to
300+ satisfied word info lookups.
301+ """
302+ self.classifier.train(StringIO("very nice words"), False)
303+ classifier = Hammie(spam._SQLite3Classifier(self.path), mode='r')
304+ self.assertTrue(
305+ classifier.score(StringIO("words, very nice")) < 0.01)
306+
307+
308+
309+ def test_largeDocumentClassification(self):
310+ """
311+ A document with more than 999 tokens can be successfully classified.
312+ """
313+ words = []
314+ for i in range(1000):
315+ word = "word%d" % (i,)
316+ words.append(word)
317+ document = " ".join(words)
318+ self.classifier.train(StringIO(document), False)
319+
320+ classifier = Hammie(spam._SQLite3Classifier(self.path), mode='r')
321+ self.assertTrue(
322+ classifier.score(StringIO(document)) < 0.01)
323+
324+
325
326 class SpambayesFilterTestCase(unittest.TestCase, MessageCreationMixin):
327 """

Subscribers

People subscribed via source and target branches