Merge lp:~inspirated/launchpad/implement-Bug-findAttachments into lp:launchpad

Proposed by Kamran Riaz Khan
Status: Rejected
Rejected by: Robert Collins
Proposed branch: lp:~inspirated/launchpad/implement-Bug-findAttachments
Merge into: lp:launchpad
Diff against target: 234 lines (+175/-0)
7 files modified
lib/lp/bugs/configure.zcml (+1/-0)
lib/lp/bugs/interfaces/bug.py (+10/-0)
lib/lp/bugs/model/bug.py (+61/-0)
lib/lp/bugs/tests/test_bug_find_attachment.py (+87/-0)
lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt (+4/-0)
lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt (+7/-0)
lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt (+5/-0)
To merge this branch: bzr merge lp:~inspirated/launchpad/implement-Bug-findAttachments
Reviewer Review Type Date Requested Status
Robert Collins (community) Abstain
Graham Binns (community) code Disapprove
Björn Tillenius Pending
Review via email: mp+27786@code.launchpad.net

This proposal supersedes a proposal from 2010-06-15.

Description of the change

Summary:
As part of my Google Summer of Code project, I had to implement attachment searching functionality in Arsenal. The end-product would allow user to specify a search string which would be searched in all the bug attachments for a project.

Doing this efficiently required two modifications in Launchpad:
    * Exposing a bug collection for a particular product
    * Implement a attachment search method for a particular bug (this branch)

Proposed fix:
Export a read operation findAttachments in IBug which returns a collection of IBugAttachment.

Pre-implementation Notes:
After a lengthy discussion on #launchpad-reviews with gmb, stub and BjornT, it was concluded that a naive string search was not suitable for inclusion into Launchpad. Improvements suggested were to:
    * Add an attachment size limit
    * Read the attachments in chunks
These improvements have been taken into account by adding a file size limit and using Boyer-Moore-Horspool algorithm for fast and memory efficient searching through the attachments. The algorithm performs better as needle strings grow longer [1].

Implementation Details:
    * Attachments larger than 8 mb cannot be searched
    * Boyer–Moore–Horspool is implemented via a nested function in Bug.findAttachments()
    * Zope configuration had to be updated to export the method
    * Multi-line searches are supported by an "is_encoded" parameter. If the client wants to include newline characters in <text>, it can encode them using text.encode('unicode_escape')
    * Attachments are searched according to default character encoding. The use of locale.getpreferredencoding() for this purpose bothers me a bit but I don't know how else to settle on a character encoding for the attachments. Instead of "assuming" that the attachments have the default encoding, I could use an encoding detection but that would introduce a whole lot of extra code in LP.

Tests:
Tests are included for plain text as well as multi-line and unicode searches:
bin/test -vvc -m lp.bugs.tests.test_bug_find_attachment

Demo and Q/A:
Open any Launchpad bug in a browser:
 https://launchpad.dev/bugs/15
Create an attachment and upload any text file containing the string ‘char buf’.
Create a Launchpad instance:
 >>> from launchpadlib.launchpad import Launchpad
 >>> launchpad = Launchpad.login_with('just testing', 'https://api.launchpad.dev/beta/')
 The authorization page:
    (https://launchpad.dev/+authorize-token?oauth_token=ppd2bRZDnN6VX94lRqrq)
 should be opening in your browser. After you have authorized
 this program to access Launchpad on your behalf you should come
 back here and press <Enter> to finish the authentication process.
Load the bug:
 >>> bug = launchpad.bugs[15]
Search for the attachment containing the string ‘char buf’:
 >>> results = bug.findAttachments(text=u'char buf')
 >>> for attachment in results:
 ... print attachment.title
 ...
 Buffer Overflow Intro.txt

lint:
The changes are lint clean.

Links:
[1] http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm

To post a comment you must log in.
Revision history for this message
Graham Binns (gmb) wrote : Posted in a previous version of this proposal

After a lengthy discussion in #launchpad-reviews we've agreed that there should be a different solution to this problem.

review: Disapprove (code)
11018. By Kamran Riaz Khan

Use encode('unicode_escape') instead of urllib.quote() for multiline
searches

11019. By Kamran Riaz Khan

Minor changes to improve code readability.

11020. By Kamran Riaz Khan

Removed urllib2 usage in Bug.findAttachments() by opening attachments
directly.

11021. By Kamran Riaz Khan

Minor changes to improve unittest readability.

Revision history for this message
Graham Binns (gmb) wrote :

Hi Kamran, sorry for not getting back to you sooner.

I'll take a look at the code now, but it seems from your cover letter that you've not taken into account Bjorn's (and mine, after some thinking) concerns that we simply shouldn't be searching files in a webservice request; it has far too many pitfalls.

I'd like to get Bjorn's input on this too, because I think that whilst your proposed solution obviates some of the problems it doesn't avoid all of them.

Revision history for this message
Graham Binns (gmb) wrote :
Download full text (12.5 KiB)

Hi Kamran,

Thanks for this branch. I like your implementation, though there are
some issues that need to be addressed WRT coding style before it would
be suitable for landing.

However, as I said above, I don't think that this is the right way to
solve the problem. There's too much potential for timeouts when we start
looping over file contents during HTTP requests. It might not be too bad
for small files, but what about a bug that has, for example, 10
attachments, each of which are 7.9MB in size? That's 79MB we need to
loop over, and I'm fairly certain that our timeout mechanisms would kick
in and kill the request before the search returned.

I'm going to mark this review as 'disapproved,' not because the code is
bad (it isn't) but because I don't think this is the right solution to
the problem. I'm sorry to say that I don't know what the right solution
to the problem actually is at this point, but I'd guess that something
involving FTIs would be a start, or some kind of asynchronous processing
of searches (though then you get into all kinds of knotty stuff with
callbacks).

I strongly suggest that you send an email to the launcpad-dev mailing
list outlining the problem and your proposed solution and inviting
people to suggest solutions. This is one of those areas where having
more eyes on the bug makes it (hopefully) easier to solve.

> === modified file 'lib/lp/bugs/configure.zcml'
> --- lib/lp/bugs/configure.zcml 2010-06-04 09:31:21 +0000
> +++ lib/lp/bugs/configure.zcml 2010-06-17 18:08:34 +0000
> @@ -644,6 +644,7 @@
> getBugWatch
> canBeNominatedFor
> getNominationFor
> + findAttachments
> getNominations
> date_last_message
> number_of_duplicates
>
> === modified file 'lib/lp/bugs/interfaces/bug.py'
> --- lib/lp/bugs/interfaces/bug.py 2010-06-10 18:55:22 +0000
> +++ lib/lp/bugs/interfaces/bug.py 2010-06-17 18:08:34 +0000
> @@ -361,6 +361,16 @@
> latest_patch = Attribute("The most recent patch of this bug.")
>
> @operation_parameters(
> + text=TextLine(title=_("Search text"), default=u""),
> + is_escaped=Bool(title=_("Search text is escaped"),
> + default=False, required=False))
> + @operation_returns_collection_of(IBugAttachment)
> + @export_read_operation()
> + def findAttachments(text=u"", is_escaped=False):

As I think said in the previous review, you should be passing None as a
default here rather than an empty string.

> + """Return all attachments related to this bug which match
> + <text>."""

Multi-line docstrings should be formatted per PEP 257. In this case:

    """Return all bug attachments whose contents contain `text`."""

> +
> + @operation_parameters(
> subject=optional_message_subject_field(),
> content=copy_field(IMessage['content']))
> @call_with(owner=REQUEST_USER)
>
> === modified file 'lib/lp/bugs/model/bug.py'
> --- lib/lp/bugs/model/bug.py 2010-06-10 18:55:22 +0000
> +++ lib/lp/bugs/model/bug.py 2010-06-17 18:08:34 +0000
> @@ -19,6 +19,7 @@
> ]
>
>
> +import locale
> impor...

review: Disapprove (code)
Revision history for this message
Robert Collins (lifeless) wrote :

Scanning even a single MB of content in a webserver request is going to have serious consequences on our capacity and performance (due to python and the GIL amongst other things).

We should broadly provide two things:
 - index the attachments where we can do so. We may need search overhauled - please feed your SoC stories into https://dev.launchpad.net/LEP/Search - thanks
 - events for new attachments so that someone wanting to can write an external custom indexer for such things.

This branch doesn't do either of these - sorry, but I really can't accept it into the code base.

Please speak with me or Graham (or anyone on the Bugs or Foundations teams) to move forward with your project - it sounds cool and I want to see it happen.

review: Abstain

Unmerged revisions

11021. By Kamran Riaz Khan

Minor changes to improve unittest readability.

11020. By Kamran Riaz Khan

Removed urllib2 usage in Bug.findAttachments() by opening attachments
directly.

11019. By Kamran Riaz Khan

Minor changes to improve code readability.

11018. By Kamran Riaz Khan

Use encode('unicode_escape') instead of urllib.quote() for multiline
searches

11017. By Kamran Riaz Khan

Changes in Bug.findAttachments():
 * Use Boyer-Moore-Horspool algorithm to search attachments in chunks
 * Fixed unicode and multiline unittests

11016. By Kamran Riaz Khan

Added support for multi-line searches in Bug.findAttachments() using
is_encoded parameter

11015. By Kamran Riaz Khan

Fixed Bug.findAttachments() to use preferred encoding instead of UTF-8

11014. By Kamran Riaz Khan

Changes in Bug.findAttachments():
 * Limited the size of attachments that can be searched to 8 MB
 * Modified searching to go-through attachments line-wise

11013. By Kamran Riaz Khan

Fixed unicode searches in Bug.findAttachments()

11012. By Kamran Riaz Khan

Added unit tests for Bug.findAttachments() method

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/bugs/configure.zcml'
2--- lib/lp/bugs/configure.zcml 2010-06-04 09:31:21 +0000
3+++ lib/lp/bugs/configure.zcml 2010-06-17 18:08:34 +0000
4@@ -644,6 +644,7 @@
5 getBugWatch
6 canBeNominatedFor
7 getNominationFor
8+ findAttachments
9 getNominations
10 date_last_message
11 number_of_duplicates
12
13=== modified file 'lib/lp/bugs/interfaces/bug.py'
14--- lib/lp/bugs/interfaces/bug.py 2010-06-10 18:55:22 +0000
15+++ lib/lp/bugs/interfaces/bug.py 2010-06-17 18:08:34 +0000
16@@ -361,6 +361,16 @@
17 latest_patch = Attribute("The most recent patch of this bug.")
18
19 @operation_parameters(
20+ text=TextLine(title=_("Search text"), default=u""),
21+ is_escaped=Bool(title=_("Search text is escaped"),
22+ default=False, required=False))
23+ @operation_returns_collection_of(IBugAttachment)
24+ @export_read_operation()
25+ def findAttachments(text=u"", is_escaped=False):
26+ """Return all attachments related to this bug which match
27+ <text>."""
28+
29+ @operation_parameters(
30 subject=optional_message_subject_field(),
31 content=copy_field(IMessage['content']))
32 @call_with(owner=REQUEST_USER)
33
34=== modified file 'lib/lp/bugs/model/bug.py'
35--- lib/lp/bugs/model/bug.py 2010-06-10 18:55:22 +0000
36+++ lib/lp/bugs/model/bug.py 2010-06-17 18:08:34 +0000
37@@ -19,6 +19,7 @@
38 ]
39
40
41+import locale
42 import operator
43 import re
44 from cStringIO import StringIO
45@@ -812,6 +813,66 @@
46 notification.date_emailed = UTC_NOW
47 notification.syncUpdate()
48
49+ def findAttachments(self, text=u"", is_escaped=False):
50+ """See `IBugAttachment`."""
51+ def boyermoore_horspool(fd, needle):
52+ nlen = len(needle)
53+ nlast = nlen - 1
54+
55+ skip = []
56+ for i in range(256):
57+ skip.append(nlen)
58+ for i in range(nlast):
59+ skip[ord(needle[i])] = nlast - i
60+
61+ pos = 0
62+ consumed = 0
63+ haystack = bytes()
64+ while True:
65+ more = nlen - (consumed - pos)
66+ morebytes = fd.read(more)
67+ haystack = haystack[more:] + morebytes
68+
69+ if len(morebytes) < more:
70+ return -1
71+ consumed = consumed + more
72+
73+ i = nlast
74+ while i >= 0 and haystack[i] == needle[i]:
75+ i = i - 1
76+ if i == -1:
77+ return pos
78+
79+ pos = pos + skip[ord(haystack[nlast])]
80+
81+ return -1
82+
83+ if not isinstance(text, unicode) and not isinstance(text, str):
84+ raise TypeError, "`text` must be a unicode string"
85+ if not text:
86+ return EmptyResultSet()
87+
88+ if is_escaped:
89+ text = text.decode('unicode_escape')
90+
91+ # Search the attachments using default encoding
92+ text = text.encode(locale.getpreferredencoding())
93+
94+ store = IStore(Bug)
95+ result = store.find(BugAttachment, BugAttachment.bug == self.id)
96+ # Do not search attachments larger than 8 mb
97+ maxsize = 8388608
98+ matches = set()
99+ for attachment in result:
100+ if attachment.data.content.filesize > maxsize:
101+ continue
102+ attachment.data.open()
103+ if boyermoore_horspool(attachment.data, text) != -1:
104+ matches.add(attachment)
105+ attachment.data.close()
106+
107+ return list(matches)
108+
109 def newMessage(self, owner=None, subject=None,
110 content=None, parent=None, bugwatch=None,
111 remote_comment_id=None):
112
113=== added file 'lib/lp/bugs/tests/test_bug_find_attachment.py'
114--- lib/lp/bugs/tests/test_bug_find_attachment.py 1970-01-01 00:00:00 +0000
115+++ lib/lp/bugs/tests/test_bug_find_attachment.py 2010-06-17 18:08:34 +0000
116@@ -0,0 +1,87 @@
117+# -*- coding: utf-8 -*-
118+# Copyright 2010 Canonical Ltd. This software is licensed under the
119+# GNU Affero General Public License version 3 (see the file LICENSE).
120+
121+__metaclass__ = type
122+
123+import os
124+import transaction
125+import urllib
126+import unittest
127+
128+from canonical.config import config
129+from canonical.testing import LaunchpadZopelessLayer
130+
131+from lp.bugs.interfaces.bugjob import BugJobType
132+from lp.bugs.model.bugjob import BugJob, BugJobDerived
133+from lp.testing import TestCaseWithFactory
134+
135+from storm.store import EmptyResultSet
136+
137+from zope.security.proxy import isinstance
138+
139+class BugFindAttachmentTestCase(TestCaseWithFactory):
140+ """Test case for bug attachment searches."""
141+
142+ layer = LaunchpadZopelessLayer
143+
144+ def setUp(self):
145+ super(BugFindAttachmentTestCase, self).setUp()
146+
147+ dir = os.path.join(config.root, 'lib/lp/bugs/tests/testfiles')
148+ filenames = [
149+ 'sample-attachment-lorem.txt',
150+ 'sample-attachment-repeatchars.txt',
151+ 'sample-attachment-unicode.txt',
152+ ]
153+ self.searches = [
154+ (u'Lorem', u'nulla pariatur', u'sint occaecat cupidatat'),
155+ (u'abccab', u'abcabc', u'abcxyzabcxyz', u'aabbccxxyyzz'),
156+ (u'νύχτας', u'Απλό γεύματος', u'ειδικά σαν πω.'),
157+ ]
158+ self.multilinesearches = [
159+ (u"est laborum.\n\nEam duis",),
160+ (u"abccab\n abcabc",),
161+ (u"σαν πω.\n\n\nΌλη",),
162+ ]
163+ filepaths = [
164+ os.path.join(dir, filename) for
165+ filename in filenames
166+ ]
167+ person = self.factory.makePerson()
168+ self.bug = self.factory.makeBug()
169+
170+ self.attachments = []
171+ for filename, filepath in zip(filenames, filepaths):
172+ attachment = self.factory.makeBugAttachment(
173+ bug=self.bug, owner=person, data=file(filepath),
174+ comment=filename, filename=filename)
175+ self.attachments.append(attachment)
176+
177+ transaction.commit()
178+
179+ def test_unicode_type_check(self):
180+ self.assertRaises(TypeError, self.bug.findAttachments, 42)
181+
182+ def test_empty_result_set(self):
183+ result = self.bug.findAttachments(u'')
184+ self.assertTrue(isinstance(result, EmptyResultSet))
185+
186+ def test_search_consistency(self):
187+ for (searches, attachment) in zip(self.searches, self.attachments):
188+ for search in searches:
189+ expected = [attachment]
190+ result = self.bug.findAttachments(search)
191+ self.assertEqual(expected, result)
192+
193+ def test_multiline_search_consistency(self):
194+ for (searches, attachment) in (
195+ zip(self.multilinesearches, self.attachments)):
196+ for search in searches:
197+ search = search.encode('unicode_escape')
198+ expected = [attachment]
199+ result = self.bug.findAttachments(search, is_escaped=True)
200+ self.assertEqual(expected, result)
201+
202+def test_suite():
203+ return unittest.TestLoader().loadTestsFromName(__name__)
204
205=== added file 'lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt'
206--- lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt 1970-01-01 00:00:00 +0000
207+++ lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt 2010-06-17 18:08:34 +0000
208@@ -0,0 +1,4 @@
209+Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
210+
211+Eam duis iisque maluisset cu, minim prodesset qui no. Sit justo volutpat torquatos ex. Fuisset instructior ius te, utinam munere luptatum te nec. Ad quis lorem delenit pro, ut pro maiorum molestiae. Impedit vituperata reprimique sed ne, qui indoctum aliquando adversarium ad. Eu mel congue praesent. His reque ornatus albucius at, ea simul civibus elaboraret vel.
212+
213
214=== added file 'lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt'
215--- lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt 1970-01-01 00:00:00 +0000
216+++ lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt 2010-06-17 18:08:34 +0000
217@@ -0,0 +1,7 @@
218+This file contains special search strings in which patterns repeat:
219+
220+ abccab
221+ abcabc
222+ abcxyzabcxyz
223+ aabbccxxyyzz
224+
225
226=== added file 'lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt'
227--- lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt 1970-01-01 00:00:00 +0000
228+++ lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt 2010-06-17 18:08:34 +0000
229@@ -0,0 +1,5 @@
230+Τα όρο νύχτας διευθυντές εκτελούσαν, ας έχω άμεση έκδοση αποστηθίσει. Άτομο αφήσεις επιστρέφουν του δε. Με δείξει έστειλε καθυστερούσε όρο, για' σε πρόσληψη εξακολουθεί επιδιορθώσεις. Απλό γεύματος εργαζόμενοι κι νέα, εγώ υπηρεσία καθορίζουν προγραμματιστής μα. Επί ξέχασε κειμένων οι, υόρκη πιθανό κοιτάζοντας ώρα πω. Στην ειδικά σαν πω.
231+
232+
233+Όλη πόρτες εντυπωσιακό μη, νέο δημιουργια λιγότερους αν. Φράση κάποιο τρόποι κι για, όχι τρόποι δίνοντας δε, αν αρέσει βασανίζουν περισσότερο για'. Μα πιο τελικά δουλεύει, δε νέο δοκιμάσεις ευκολότερο διασφαλίζεται. Οι την πάρεις εργαζόμενοι μεταγλωτιστής, δε δύο πλέον κώδικα καθυστερεί, πες το στήλες διορθώσει. Όσο μη κόλπα παραγωγικής, με κλπ λέει τότε συνηθίζουν, κάποιο είχαμε για το.
234+