Merge lp:~exarkun/divmod.org/database-classifier-persistence into lp:divmod.org

Proposed by Jean-Paul Calderone
Status: Merged
Merged at revision: 2697
Proposed branch: lp:~exarkun/divmod.org/database-classifier-persistence
Merge into: lp:divmod.org
Diff against target: 605 lines (+450/-43)
6 files modified
Quotient/README.txt (+1/-1)
Quotient/xquotient/spam.py (+266/-26)
Quotient/xquotient/test/historic/stub_spambayesfilter2to3.py (+36/-0)
Quotient/xquotient/test/historic/test_spambayesfilter1to2.py (+1/-1)
Quotient/xquotient/test/historic/test_spambayesfilter2to3.py (+37/-0)
Quotient/xquotient/test/test_spambayes.py (+109/-15)
To merge this branch: bzr merge lp:~exarkun/divmod.org/database-classifier-persistence
Reviewer Review Type Date Requested Status
Tristan Seligmann Needs Fixing
Review via email: mp+120231@code.launchpad.net

Description of the change

Change the persistence of spambayes classifier data to use SQLite3 instead of pickle, to improve performance with large training sets (and probably to provide other benefits, like reliability, introspectability, maintainability, etc).

To post a comment you must log in.
2706. By Jean-Paul Calderone

Spambayes dependency went up

Revision history for this message
Tristan Seligmann (mithrandi) wrote :

Insofar as I understand anything that is happening in this code, the changes look reasonable. I have a few minor points:

1. xquotient.spam._SQLite3Classifier

1.1. SCHEMA is not documented in the class docstring.

1.2. The way the nspam and nham properties are defined loses the docstring. This isn't terribly important, perhaps, but it would be nice to have it available for runtime introspection.

2. xquotient.test.test_spambayes.SpambayesFilterTestCase has a handful of "XXX" comments; these should probably be turned into bug report(s).

review: Needs Fixing
2707. By Jean-Paul Calderone

Document SCHEMA and make the nspam and nham documentation more accessible.

2708. By Jean-Paul Calderone

Bug report reference for poor test coverage

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'Quotient/README.txt'
2--- Quotient/README.txt 2009-07-06 12:50:36 +0000
3+++ Quotient/README.txt 2012-08-20 19:55:33 +0000
4@@ -15,7 +15,7 @@
5 Dependencies
6 ------------
7
8-Quotient is an offering for Divmod Mantissa. It depends on SpamBayes 1.0 and
9+Quotient is an offering for Divmod Mantissa. It depends on SpamBayes 1.1a6 and
10 Divmod Mantissa. See DEPS.txt in the Mantissa release for more information on
11 Mantissa dependencies.
12
13
14=== modified file 'Quotient/xquotient/spam.py'
15--- Quotient/xquotient/spam.py 2012-07-26 15:59:07 +0000
16+++ Quotient/xquotient/spam.py 2012-08-20 19:55:33 +0000
17@@ -2,6 +2,7 @@
18
19 import cPickle, errno
20 from decimal import Decimal
21+import sqlite3
22
23 from spambayes import hammie, classifier
24
25@@ -362,38 +363,253 @@
26 return df
27 registerUpgrader(_dspamFilter1to2, DSPAMFilter.typeName, 1, 2)
28
29+class _SQLite3Classifier(object, classifier.Classifier):
30+ """
31+ A Spambayes classifier which implements training dataset persistence in a
32+ SQLite3 database.
33+
34+ The dataset is persisted in a dedicated SQLite3 database rather than in the
35+ user store because the data is accessed from the batch process. A dedicated
36+ database will lead to fewer contention problems than re-using the user
37+ store.
38+
39+ Axiom is not used for the training dataset database because of the
40+ complicated, inflexible hooks supplied by the base
41+ L{spambayes.classifier.Classifier}, which in particular make getting
42+ transaction management difficult with Axiom.
43+
44+ @cvar SCHEMA: A C{list} of C{str} giving the schema initialization
45+ statements. These are executed any time the classifier database is
46+ opened, with the expected failure which occurs any time the schema has
47+ already been initialized handled and disregarded.
48+ """
49+
50+ SCHEMA = [
51+ "CREATE TABLE bayes ("
52+ " word TEXT NOT NULL DEFAULT '',"
53+ " nspam INTEGER NOT NULL DEFAULT 0,"
54+ " nham INTEGER NOT NULL DEFAULT 0,"
55+ " PRIMARY KEY(word)"
56+ ")",
57+ "CREATE INDEX bayes_word ON bayes(word)",
58+ "CREATE TABLE state ("
59+ " nspam INTEGER NOT NULL,"
60+ " nham INTEGER NOT NULL"
61+ ")",
62+ "INSERT INTO state (nspam, nham) VALUES (0, 0)",
63+ ]
64+
65+ def nspam():
66+ doc = """
67+ A property which reflects the number of messages trained as spam, while
68+ also automatically persisting any changes to this value (which the base
69+ class will make) to the database.
70+ """
71+ def get(self):
72+ return self._nspam
73+ def set(self, value):
74+ self._nspam = value
75+ self._recordState()
76+ return get, set, None, doc
77+ nspam = property(*nspam())
78+
79+ def nham():
80+ doc = """
81+ A property which reflects the number of messages trained as ham, while
82+ also automatically persisting any changes to this value (which the base
83+ class will make) to the database.
84+ """
85+ def get(self):
86+ return self._nham
87+ def set(self, value):
88+ self._nham = value
89+ self._recordState()
90+ return get, set, None, doc
91+ nham = property(*nham())
92+
93+ db = cursor = databaseName = None
94+
95+ def __init__(self, databaseName):
96+ """
97+ Initialize this classifier.
98+
99+ @param databaseName: The path to the SQLite3 database from which to load
100+ and to which to store training data.
101+ """
102+ classifier.Classifier.__init__(self)
103+ self.databaseName = databaseName
104+ # Open the database, possibly initializing it if it has not yet been
105+ # initialized, and then load the necessary global state from it (nspam,
106+ # nham).
107+ self.load()
108+
109+
110+ def load(self):
111+ """
112+ Open the training dataset database, initializing it if necessary, and
113+ loading necessary initial state from it.
114+ """
115+ self.db = sqlite3.connect(self.databaseName, isolation_level='IMMEDIATE')
116+ self.cursor = self.db.cursor()
117+ try:
118+ for statement in self.SCHEMA:
119+ self.cursor.execute(statement)
120+ except sqlite3.OperationalError as e:
121+ # Table already exists
122+ self.db.rollback()
123+ else:
124+ self.db.commit()
125+
126+ self.cursor.execute('SELECT nspam, nham FROM state')
127+ rows = self.cursor.fetchall()
128+ self._nspam, self._nham = rows[0]
129+
130+
131+ def close(self):
132+ """
133+ Close the training dataset database.
134+ """
135+ self.cursor.close()
136+ self.db.close()
137+ self.db = self.cursor = None
138+
139+
140+ def _recordState(self):
141+ """
142+ Save nspam and nham, if the database has been opened (it will not have
143+ been opened when this gets run from the nspam/nham property setters
144+ invoked from Classifier.__init__, and that's just fine with us).
145+ """
146+ if self.cursor is not None:
147+ self.cursor.execute('UPDATE state SET nspam=?, nham=?', (self._nspam, self._nham))
148+
149+
150+ def _get(self, word):
151+ """
152+ Load the training data for the given word.
153+
154+ @param word: A word (or any other kind of token) to load information about.
155+ @type word: C{str} or C{unicode} (but really, C{unicode} please)
156+
157+ @return: C{None} if there is no training data for the given word,
158+ otherwise a two-sequence of the number of times the token has been
159+ trained as spam and the number of times it has been trained as ham.
160+ """
161+ if isinstance(word, str):
162+ word = word.decode('utf-8', 'replace')
163+
164+ self.cursor.execute(
165+ "SELECT word, nspam, nham FROM bayes WHERE word=?", (word,))
166+ rows = self.cursor.fetchall()
167+ if rows:
168+ return rows[0]
169+ return None
170+
171+
172+ def _set(self, word, nspam, nham):
173+ """
174+ Update the training data for a particular word.
175+
176+ @param word: A word (or any other kind of token) to store training
177+ information about.
178+ @type word: C{str} or C{unicode} (but really, C{unicode} please)
179+
180+ @param nspam: The number of times the token has been trained as spam.
181+ @param nham: The number of times the token has been trained as ham.
182+ """
183+ if isinstance(word, str):
184+ word = word.decode('utf-8', 'replace')
185+ self.cursor.execute(
186+ "INSERT OR REPLACE INTO bayes (word, nspam, nham) "
187+ "VALUES (?, ?, ?)",
188+ (word, nspam, nham))
189+
190+
191+ def _delete(self, word):
192+ """
193+ Forget the training data for a particular word.
194+
195+ @param word: A word (or any other kind of token) to lose training
196+ information about.
197+ @type word: C{str} or C{unicode} (but really, C{unicode} please)
198+ """
199+ if isinstance(word, str):
200+ word = word.decode('utf-8', 'replace')
201+ self.cursor.execute("DELETE FROM bayes WHERE word=?", (word,))
202+
203+
204+ def _post_training(self):
205+ """
206+ L{Classifier} hook invoked after all other steps of training a message
207+ have been completed. This is used to commit the currently active
208+ transaction, which contains all of the database modifications for each
209+ token in that message.
210+ """
211+ self.db.commit()
212+
213+
214+ def _bulkwordinfoset(self, words):
215+ """
216+ Upgrade helper for recording spam and ham information for many words at
217+ once.
218+ """
219+ self.cursor.executemany(
220+ "INSERT OR REPLACE INTO bayes (word, nspam, nham) "
221+ "VALUES (?, ?, ?)",
222+ ((word.decode('utf-8', 'replace'), record.spamcount, record.hamcount)
223+ for (word, record)
224+ in words))
225+
226+
227+ def _wordinfoset(self, word, record):
228+ """
229+ L{Classifier} hook invoked to record a changed spam and ham data for a
230+ single word.
231+ """
232+ self._set(word, record.spamcount, record.hamcount)
233+
234+
235+ def _wordinfoget(self, word):
236+ """
237+ L{Classifier} hook invoked to retrieve persisted spam and ham data for a
238+ single word.
239+ """
240+ row = self._get(word)
241+ if row:
242+ item = self.WordInfoClass()
243+ item.__setstate__((row[1], row[2]))
244+ return item
245+ return None
246+
247+
248+ def _wordinfodel(self, word):
249+ """
250+ L{Classifier} hook invoked to discard the persisted spam and ham data
251+ for a single word.
252+ """
253+ self._delete(word)
254+
255+
256+
257 class SpambayesFilter(item.Item):
258 """
259 Spambayes-based L{iquotient.IHamFilter} powerup.
260 """
261 implements(iquotient.IHamFilter)
262- schemaVersion = 2
263+ schemaVersion = 3
264 classifier = attributes.inmemory()
265 guesser = attributes.inmemory()
266 filter = dependsOn(Filter)
267
268 powerupInterfaces = (iquotient.IHamFilter,)
269
270-
271-
272 def _classifierPath(self):
273- return self.store.newFilePath('spambayes-%d-classifier.pickle' % (self.storeID,))
274+ return self.store.newFilePath('spambayes-%d-classifier.sqlite' % (self.storeID,))
275
276
277 def activate(self):
278- try:
279- try:
280- c = cPickle.load(self._classifierPath().open())
281- except IOError, e:
282- if e.errno != errno.ENOENT:
283- raise
284- c = classifier.Classifier()
285- except:
286- log.msg("Loading Spambayes trained state failed:")
287- log.err()
288- c = classifier.Classifier()
289- self.classifier = c
290- self.guesser = hammie.Hammie(c, mode='r')
291+ self.classifier = _SQLite3Classifier(self._classifierPath().path)
292+ self.guesser = hammie.Hammie(self.classifier, mode='r')
293
294
295 # IHamFilter
296@@ -418,21 +634,13 @@
297 else:
298 if self.classify(item) > SPAM_THRESHHOLD:
299 break
300- self.classify(item)
301- p = self._classifierPath()
302- if not p.parent().exists():
303- p.parent().makedirs()
304- t = p.temporarySibling()
305- cPickle.dump(self.classifier, t.open())
306- t.moveTo(p)
307
308
309 def forgetTraining(self):
310 p = self._classifierPath()
311 if p.exists():
312 p.remove()
313- self.classifier = classifier.Classifier()
314- self.guesser = hammie.Hammie(self.classifier, mode='r')
315+ self.activate()
316
317
318 item.declareLegacyItem(SpambayesFilter.typeName, 1, dict(
319@@ -444,6 +652,38 @@
320 return sbf
321 registerUpgrader(_sbFilter1to2, SpambayesFilter.typeName, 1, 2)
322
323+item.declareLegacyItem(SpambayesFilter.typeName, 2, dict(
324+ filter=attributes.reference()))
325+
326+
327+def _sbFilter2to3(old):
328+ """
329+ Convert the pickled spambayes data to a SQLite3 database of the same data.
330+ """
331+ sbf = old.upgradeVersion(
332+ SpambayesFilter.typeName, 2, 3, filter=old.filter)
333+ path = sbf.store.newFilePath('spambayes-%d-classifier.pickle' % (sbf.storeID,))
334+ try:
335+ fObj = path.open()
336+ except IOError, e:
337+ if e.errno != errno.ENOENT:
338+ raise
339+ else:
340+ return sbf
341+ try:
342+ dataset = cPickle.load(fObj)
343+ finally:
344+ fObj.close()
345+
346+ words = dataset._wordinfokeys()
347+ sbf.classifier._bulkwordinfoset(
348+ ((word, dataset._wordinfoget(word)) for word in words))
349+ sbf.classifier.nspam = dataset.nspam
350+ sbf.classifier.nham = dataset.nham
351+ sbf.classifier._post_training()
352+ return sbf
353+registerUpgrader(_sbFilter2to3, SpambayesFilter.typeName, 2, 3)
354+
355
356 class SpambayesBenefactor(item.Item):
357 endowed = attributes.integer(default=0)
358
359=== added file 'Quotient/xquotient/test/historic/spambayesfilter2to3.axiom.tbz2'
360Binary files Quotient/xquotient/test/historic/spambayesfilter2to3.axiom.tbz2 1970-01-01 00:00:00 +0000 and Quotient/xquotient/test/historic/spambayesfilter2to3.axiom.tbz2 2012-08-20 19:55:33 +0000 differ
361=== added file 'Quotient/xquotient/test/historic/stub_spambayesfilter2to3.py'
362--- Quotient/xquotient/test/historic/stub_spambayesfilter2to3.py 1970-01-01 00:00:00 +0000
363+++ Quotient/xquotient/test/historic/stub_spambayesfilter2to3.py 2012-08-20 19:55:33 +0000
364@@ -0,0 +1,36 @@
365+from StringIO import StringIO
366+
367+from axiom.test.historic.stubloader import saveStub
368+from axiom.userbase import LoginMethod
369+from axiom.dependency import installOn
370+
371+from xquotient.spam import SpambayesFilter
372+
373+SPAM_A = ["spam", "bad"]
374+SPAM_B = ["junk", "evil"]
375+HAM = ["ham", "good", "wonderful", "justice"]
376+AMBIGUOUS = "ambiguous"
377+
378+
379+class Document(object):
380+ def __init__(self, text):
381+ self.impl = self
382+ self.source = self
383+ self.text = text
384+
385+
386+ def open(self):
387+ return StringIO(self.text)
388+
389+
390+
391+def createDatabase(s):
392+ bayes = SpambayesFilter(store=s)
393+ installOn(bayes, s)
394+ bayes.train(True, Document(" ".join(SPAM_A)))
395+ bayes.train(True, Document(" ".join(SPAM_B + [AMBIGUOUS])))
396+ bayes.train(False, Document(" ".join(HAM + [AMBIGUOUS])))
397+
398+
399+if __name__ == '__main__':
400+ saveStub(createDatabase, 'exarkun@twistedmatrix.com-20120811233233-fxrt1q924l6wmgxp')
401
402=== modified file 'Quotient/xquotient/test/historic/test_spambayesfilter1to2.py'
403--- Quotient/xquotient/test/historic/test_spambayesfilter1to2.py 2007-01-23 04:03:52 +0000
404+++ Quotient/xquotient/test/historic/test_spambayesfilter1to2.py 2012-08-20 19:55:33 +0000
405@@ -3,7 +3,7 @@
406
407
408 class SpambayesFilterTestCase(stubloader.StubbedTest):
409- def testUpgrade(self):
410+ def test_upgradedFields(self):
411 """
412 Ensure upgraded fields refer to correct items.
413 """
414
415=== added file 'Quotient/xquotient/test/historic/test_spambayesfilter2to3.py'
416--- Quotient/xquotient/test/historic/test_spambayesfilter2to3.py 1970-01-01 00:00:00 +0000
417+++ Quotient/xquotient/test/historic/test_spambayesfilter2to3.py 2012-08-20 19:55:33 +0000
418@@ -0,0 +1,37 @@
419+import sqlite3
420+from axiom.test.historic import stubloader
421+
422+from xquotient.spam import SpambayesFilter
423+from xquotient.test.historic.stub_spambayesfilter2to3 import (
424+ SPAM_A, SPAM_B, HAM, AMBIGUOUS)
425+from xquotient.test.historic.test_spambayesfilter1to2 import SpambayesFilterTestCase
426+
427+class SpambayesFilterTestCase(SpambayesFilterTestCase):
428+ def test_upgradedTrainingDataset(self):
429+ """
430+ The upgrade converts the pickled training set into a SQLite3 training
431+ set. All of the training data is retained.
432+ """
433+ bayes = self.store.findUnique(SpambayesFilter)
434+ db = sqlite3.connect(bayes._classifierPath().path)
435+ cursor = db.cursor()
436+ self.addCleanup(db.close)
437+ self.addCleanup(cursor.close)
438+
439+ cursor.execute('SELECT word, nham, nspam FROM bayes')
440+ expected = set(
441+ [(word, 0, 10) for word in SPAM_A] +
442+ [(word, 0, 10) for word in SPAM_B] +
443+ [(word, 1, 0) for word in HAM] +
444+ [(AMBIGUOUS, 1, 10)])
445+ found = set(cursor.fetchall())
446+
447+ # There may be extra tokens due to funny extra spambayes logic. That's
448+ # fine. As long as we see all the tokens we put there, the upgrade is a
449+ # success.
450+ self.assertEqual(expected, found & expected)
451+
452+ cursor.execute('SELECT nham, nspam FROM state')
453+ nham, nspam = cursor.fetchone()
454+ self.assertEqual(nham, 1)
455+ self.assertEqual(nspam, 20)
456
457=== modified file 'Quotient/xquotient/test/test_spambayes.py'
458--- Quotient/xquotient/test/test_spambayes.py 2009-07-16 00:52:55 +0000
459+++ Quotient/xquotient/test/test_spambayes.py 2012-08-20 19:55:33 +0000
460@@ -1,4 +1,8 @@
461
462+from StringIO import StringIO
463+
464+from spambayes.hammie import Hammie
465+
466 from twisted.trial import unittest
467
468 from axiom import store, userbase
469@@ -8,31 +12,121 @@
470 from xquotient.test.test_dspam import MessageCreationMixin
471
472
473+class SQLite3ClassifierTests(unittest.TestCase):
474+ """
475+ Tests for L{xquotient.spam._SQLite3Classifier}, a spambayes classifier which
476+ persists training data in a SQLite3 database.
477+ """
478+ def setUp(self):
479+ self.path = self.mktemp()
480+ self.bayes = spam._SQLite3Classifier(self.path)
481+ self.classifier = Hammie(self.bayes, mode='r')
482+
483+
484+ def test_nspam(self):
485+ """
486+ L{SQLite3Classifier} tracks, in memory, the number of spam messages it
487+ has been trained with.
488+ """
489+ self.classifier.train(StringIO("spam words of spamnfulness"), True)
490+ self.assertEqual(self.bayes.nspam, 1)
491+
492+
493+ def test_nspamPersisted(self):
494+ """
495+ L{SQLite3Classifier} tracks, in a database, the number of spam messages
496+ it has been trained with.
497+ """
498+ self.classifier.train(StringIO("spam words of spamfulness"), True)
499+ bayes = spam._SQLite3Classifier(self.path)
500+ self.assertEqual(bayes.nspam, 1)
501+
502+
503+ def test_nham(self):
504+ """
505+ L{SQLite3Classifier} tracks, in memory, the number of ham messages it
506+ has been trained with.
507+ """
508+ self.classifier.train(StringIO("very nice words"), False)
509+ self.assertEqual(self.bayes.nham, 1)
510+
511+
512+ def test_nhamPersisted(self):
513+ """
514+ L{SQLite3Classifier} tracks, in a database, the number of ham messages
515+ it has been trained with.
516+ """
517+ self.classifier.train(StringIO("very nice words"), False)
518+ bayes = spam._SQLite3Classifier(self.path)
519+ self.assertEqual(bayes.nham, 1)
520+
521+
522+ def test_spamClassification(self):
523+ """
524+ L{SQLite3Classifier} can be trained with a spam message so as to later
525+ classify messages like that one as spam.
526+ """
527+ self.classifier.train(StringIO("spam words of spamfulness"), True)
528+ self.assertTrue(
529+ self.classifier.score(StringIO("spamfulness words of spam")) > 0.99)
530+
531+
532+ def test_hamClassification(self):
533+ """
534+ L{SQLite3Classifier} can be trained with a ham message so as to later
535+ classify messages like that one as ham.
536+ """
537+ self.classifier.train(StringIO("very nice words"), False)
538+ self.assertTrue(
539+ self.classifier.score(StringIO("words, very nice")) < 0.01)
540+
541+
542+
543 class SpambayesFilterTestCase(unittest.TestCase, MessageCreationMixin):
544-
545+ """
546+ Tests for L{xquotient.spam.SpambayesFilter}.
547+ """
548 def setUp(self):
549 dbdir = self.mktemp()
550 self.store = s = store.Store(dbdir)
551- ls = userbase.LoginSystem(store=s)
552- installOn(ls, s)
553- acc = ls.addAccount('username', 'dom.ain', 'password')
554- ss = acc.avatars.open()
555- self.df = spam.SpambayesFilter(store=ss)
556- installOn(self.df, ss)
557- self.f = self.df.filter
558-
559- def testMessageClassification(self):
560- self.f.processItem(self._message())
561-
562- def testMessageTraining(self):
563+ def account():
564+ ls = userbase.LoginSystem(store=s)
565+ installOn(ls, s)
566+ acc = ls.addAccount('username', 'dom.ain', 'password')
567+ return acc.avatars.open()
568+ ss = s.transact(account)
569+ def spambayes():
570+ self.df = spam.SpambayesFilter(store=ss)
571+ installOn(self.df, ss)
572+ self.f = self.df.filter
573+ ss.transact(spambayes)
574+
575+
576+ def test_messageClassification(self):
577+ """
578+ When L{SpambayesFilter} is installed, L{Filter} uses it to classify
579+ items it processes.
580+ """
581+ self.store.transact(self.f.processItem, self._message())
582+ # XXX And the actual test here would be ... ? lp:#1039197
583+
584+
585+ def test_messageTraining(self):
586+ """
587+ L{SpambayesFilter.train} adds tokens from the given message to the
588+ training dataset for either ham or spam.
589+ """
590 m = self._message()
591 self.df.classify(m)
592 self.df.train(True, m)
593+ # XXX Really, unit tests should make _some_ assertions. lp:#1039197
594+
595
596 def test_messageRetraining(self):
597 """
598 Test that removing the training data and rebuilding it succeeds.
599 """
600- self.testMessageTraining()
601+ self.test_messageTraining()
602 self.df.forgetTraining()
603-
604+ # XXX The docstring is a total lie, there is no actual testing going on
605+ # here. lp:#1039197

Subscribers

People subscribed via source and target branches