Merge lp:~inspirated/launchpad/implement-Bug-findAttachments into lp:launchpad

Proposed by Kamran Riaz Khan
Status: Rejected
Rejected by: Robert Collins
Proposed branch: lp:~inspirated/launchpad/implement-Bug-findAttachments
Merge into: lp:launchpad
Diff against target: 234 lines (+175/-0)
7 files modified
lib/lp/bugs/configure.zcml (+1/-0)
lib/lp/bugs/interfaces/bug.py (+10/-0)
lib/lp/bugs/model/bug.py (+61/-0)
lib/lp/bugs/tests/test_bug_find_attachment.py (+87/-0)
lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt (+4/-0)
lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt (+7/-0)
lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt (+5/-0)
To merge this branch: bzr merge lp:~inspirated/launchpad/implement-Bug-findAttachments
Reviewer Review Type Date Requested Status
Robert Collins (community) Abstain
Graham Binns (community) code Disapprove
Björn Tillenius Pending
Review via email: mp+27786@code.launchpad.net

This proposal supersedes a proposal from 2010-06-15.

Description of the change

Summary:
As part of my Google Summer of Code project, I had to implement attachment searching functionality in Arsenal. The end-product would allow user to specify a search string which would be searched in all the bug attachments for a project.

Doing this efficiently required two modifications in Launchpad:
    * Exposing a bug collection for a particular product
    * Implement a attachment search method for a particular bug (this branch)

Proposed fix:
Export a read operation findAttachments in IBug which returns a collection of IBugAttachment.

Pre-implementation Notes:
After a lengthy discussion on #launchpad-reviews with gmb, stub and BjornT, it was concluded that a naive string search was not suitable for inclusion into Launchpad. Improvements suggested were to:
    * Add an attachment size limit
    * Read the attachments in chunks
These improvements have been taken into account by adding a file size limit and using Boyer-Moore-Horspool algorithm for fast and memory efficient searching through the attachments. The algorithm performs better as needle strings grow longer [1].

Implementation Details:
    * Attachments larger than 8 mb cannot be searched
    * Boyer–Moore–Horspool is implemented via a nested function in Bug.findAttachments()
    * Zope configuration had to be updated to export the method
    * Multi-line searches are supported by an "is_encoded" parameter. If the client wants to include newline characters in <text>, it can encode them using text.encode('unicode_escape')
    * Attachments are searched according to default character encoding. The use of locale.getpreferredencoding() for this purpose bothers me a bit but I don't know how else to settle on a character encoding for the attachments. Instead of "assuming" that the attachments have the default encoding, I could use an encoding detection but that would introduce a whole lot of extra code in LP.

Tests:
Tests are included for plain text as well as multi-line and unicode searches:
bin/test -vvc -m lp.bugs.tests.test_bug_find_attachment

Demo and Q/A:
Open any Launchpad bug in a browser:
 https://launchpad.dev/bugs/15
Create an attachment and upload any text file containing the string ‘char buf’.
Create a Launchpad instance:
 >>> from launchpadlib.launchpad import Launchpad
 >>> launchpad = Launchpad.login_with('just testing', 'https://api.launchpad.dev/beta/')
 The authorization page:
    (https://launchpad.dev/+authorize-token?oauth_token=ppd2bRZDnN6VX94lRqrq)
 should be opening in your browser. After you have authorized
 this program to access Launchpad on your behalf you should come
 back here and press <Enter> to finish the authentication process.
Load the bug:
 >>> bug = launchpad.bugs[15]
Search for the attachment containing the string ‘char buf’:
 >>> results = bug.findAttachments(text=u'char buf')
 >>> for attachment in results:
 ... print attachment.title
 ...
 Buffer Overflow Intro.txt

lint:
The changes are lint clean.

Links:
[1] http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm

To post a comment you must log in.
Revision history for this message
Graham Binns (gmb) wrote : Posted in a previous version of this proposal

After a lengthy discussion in #launchpad-reviews we've agreed that there should be a different solution to this problem.

review: Disapprove (code)
11018. By Kamran Riaz Khan

Use encode('unicode_escape') instead of urllib.quote() for multiline
searches

11019. By Kamran Riaz Khan

Minor changes to improve code readability.

11020. By Kamran Riaz Khan

Removed urllib2 usage in Bug.findAttachments() by opening attachments
directly.

11021. By Kamran Riaz Khan

Minor changes to improve unittest readability.

Revision history for this message
Graham Binns (gmb) wrote :

Hi Kamran, sorry for not getting back to you sooner.

I'll take a look at the code now, but it seems from your cover letter that you've not taken into account Bjorn's (and mine, after some thinking) concerns that we simply shouldn't be searching files in a webservice request; it has far too many pitfalls.

I'd like to get Bjorn's input on this too, because I think that whilst your proposed solution obviates some of the problems it doesn't avoid all of them.

Revision history for this message
Graham Binns (gmb) wrote :
Download full text (12.5 KiB)

Hi Kamran,

Thanks for this branch. I like your implementation, though there are
some issues that need to be addressed WRT coding style before it would
be suitable for landing.

However, as I said above, I don't think that this is the right way to
solve the problem. There's too much potential for timeouts when we start
looping over file contents during HTTP requests. It might not be too bad
for small files, but what about a bug that has, for example, 10
attachments, each of which are 7.9MB in size? That's 79MB we need to
loop over, and I'm fairly certain that our timeout mechanisms would kick
in and kill the request before the search returned.

I'm going to mark this review as 'disapproved,' not because the code is
bad (it isn't) but because I don't think this is the right solution to
the problem. I'm sorry to say that I don't know what the right solution
to the problem actually is at this point, but I'd guess that something
involving FTIs would be a start, or some kind of asynchronous processing
of searches (though then you get into all kinds of knotty stuff with
callbacks).

I strongly suggest that you send an email to the launcpad-dev mailing
list outlining the problem and your proposed solution and inviting
people to suggest solutions. This is one of those areas where having
more eyes on the bug makes it (hopefully) easier to solve.

> === modified file 'lib/lp/bugs/configure.zcml'
> --- lib/lp/bugs/configure.zcml 2010-06-04 09:31:21 +0000
> +++ lib/lp/bugs/configure.zcml 2010-06-17 18:08:34 +0000
> @@ -644,6 +644,7 @@
> getBugWatch
> canBeNominatedFor
> getNominationFor
> + findAttachments
> getNominations
> date_last_message
> number_of_duplicates
>
> === modified file 'lib/lp/bugs/interfaces/bug.py'
> --- lib/lp/bugs/interfaces/bug.py 2010-06-10 18:55:22 +0000
> +++ lib/lp/bugs/interfaces/bug.py 2010-06-17 18:08:34 +0000
> @@ -361,6 +361,16 @@
> latest_patch = Attribute("The most recent patch of this bug.")
>
> @operation_parameters(
> + text=TextLine(title=_("Search text"), default=u""),
> + is_escaped=Bool(title=_("Search text is escaped"),
> + default=False, required=False))
> + @operation_returns_collection_of(IBugAttachment)
> + @export_read_operation()
> + def findAttachments(text=u"", is_escaped=False):

As I think said in the previous review, you should be passing None as a
default here rather than an empty string.

> + """Return all attachments related to this bug which match
> + <text>."""

Multi-line docstrings should be formatted per PEP 257. In this case:

    """Return all bug attachments whose contents contain `text`."""

> +
> + @operation_parameters(
> subject=optional_message_subject_field(),
> content=copy_field(IMessage['content']))
> @call_with(owner=REQUEST_USER)
>
> === modified file 'lib/lp/bugs/model/bug.py'
> --- lib/lp/bugs/model/bug.py 2010-06-10 18:55:22 +0000
> +++ lib/lp/bugs/model/bug.py 2010-06-17 18:08:34 +0000
> @@ -19,6 +19,7 @@
> ]
>
>
> +import locale
> impor...

review: Disapprove (code)
Revision history for this message
Robert Collins (lifeless) wrote :

Scanning even a single MB of content in a webserver request is going to have serious consequences on our capacity and performance (due to python and the GIL amongst other things).

We should broadly provide two things:
 - index the attachments where we can do so. We may need search overhauled - please feed your SoC stories into https://dev.launchpad.net/LEP/Search - thanks
 - events for new attachments so that someone wanting to can write an external custom indexer for such things.

This branch doesn't do either of these - sorry, but I really can't accept it into the code base.

Please speak with me or Graham (or anyone on the Bugs or Foundations teams) to move forward with your project - it sounds cool and I want to see it happen.

review: Abstain

Unmerged revisions

11021. By Kamran Riaz Khan

Minor changes to improve unittest readability.

11020. By Kamran Riaz Khan

Removed urllib2 usage in Bug.findAttachments() by opening attachments
directly.

11019. By Kamran Riaz Khan

Minor changes to improve code readability.

11018. By Kamran Riaz Khan

Use encode('unicode_escape') instead of urllib.quote() for multiline
searches

11017. By Kamran Riaz Khan

Changes in Bug.findAttachments():
 * Use Boyer-Moore-Horspool algorithm to search attachments in chunks
 * Fixed unicode and multiline unittests

11016. By Kamran Riaz Khan

Added support for multi-line searches in Bug.findAttachments() using
is_encoded parameter

11015. By Kamran Riaz Khan

Fixed Bug.findAttachments() to use preferred encoding instead of UTF-8

11014. By Kamran Riaz Khan

Changes in Bug.findAttachments():
 * Limited the size of attachments that can be searched to 8 MB
 * Modified searching to go-through attachments line-wise

11013. By Kamran Riaz Khan

Fixed unicode searches in Bug.findAttachments()

11012. By Kamran Riaz Khan

Added unit tests for Bug.findAttachments() method

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/lp/bugs/configure.zcml'
--- lib/lp/bugs/configure.zcml 2010-06-04 09:31:21 +0000
+++ lib/lp/bugs/configure.zcml 2010-06-17 18:08:34 +0000
@@ -644,6 +644,7 @@
644 getBugWatch644 getBugWatch
645 canBeNominatedFor645 canBeNominatedFor
646 getNominationFor646 getNominationFor
647 findAttachments
647 getNominations648 getNominations
648 date_last_message649 date_last_message
649 number_of_duplicates650 number_of_duplicates
650651
=== modified file 'lib/lp/bugs/interfaces/bug.py'
--- lib/lp/bugs/interfaces/bug.py 2010-06-10 18:55:22 +0000
+++ lib/lp/bugs/interfaces/bug.py 2010-06-17 18:08:34 +0000
@@ -361,6 +361,16 @@
361 latest_patch = Attribute("The most recent patch of this bug.")361 latest_patch = Attribute("The most recent patch of this bug.")
362362
363 @operation_parameters(363 @operation_parameters(
364 text=TextLine(title=_("Search text"), default=u""),
365 is_escaped=Bool(title=_("Search text is escaped"),
366 default=False, required=False))
367 @operation_returns_collection_of(IBugAttachment)
368 @export_read_operation()
369 def findAttachments(text=u"", is_escaped=False):
370 """Return all attachments related to this bug which match
371 <text>."""
372
373 @operation_parameters(
364 subject=optional_message_subject_field(),374 subject=optional_message_subject_field(),
365 content=copy_field(IMessage['content']))375 content=copy_field(IMessage['content']))
366 @call_with(owner=REQUEST_USER)376 @call_with(owner=REQUEST_USER)
367377
=== modified file 'lib/lp/bugs/model/bug.py'
--- lib/lp/bugs/model/bug.py 2010-06-10 18:55:22 +0000
+++ lib/lp/bugs/model/bug.py 2010-06-17 18:08:34 +0000
@@ -19,6 +19,7 @@
19 ]19 ]
2020
2121
22import locale
22import operator23import operator
23import re24import re
24from cStringIO import StringIO25from cStringIO import StringIO
@@ -812,6 +813,66 @@
812 notification.date_emailed = UTC_NOW813 notification.date_emailed = UTC_NOW
813 notification.syncUpdate()814 notification.syncUpdate()
814815
816 def findAttachments(self, text=u"", is_escaped=False):
817 """See `IBugAttachment`."""
818 def boyermoore_horspool(fd, needle):
819 nlen = len(needle)
820 nlast = nlen - 1
821
822 skip = []
823 for i in range(256):
824 skip.append(nlen)
825 for i in range(nlast):
826 skip[ord(needle[i])] = nlast - i
827
828 pos = 0
829 consumed = 0
830 haystack = bytes()
831 while True:
832 more = nlen - (consumed - pos)
833 morebytes = fd.read(more)
834 haystack = haystack[more:] + morebytes
835
836 if len(morebytes) < more:
837 return -1
838 consumed = consumed + more
839
840 i = nlast
841 while i >= 0 and haystack[i] == needle[i]:
842 i = i - 1
843 if i == -1:
844 return pos
845
846 pos = pos + skip[ord(haystack[nlast])]
847
848 return -1
849
850 if not isinstance(text, unicode) and not isinstance(text, str):
851 raise TypeError, "`text` must be a unicode string"
852 if not text:
853 return EmptyResultSet()
854
855 if is_escaped:
856 text = text.decode('unicode_escape')
857
858 # Search the attachments using default encoding
859 text = text.encode(locale.getpreferredencoding())
860
861 store = IStore(Bug)
862 result = store.find(BugAttachment, BugAttachment.bug == self.id)
863 # Do not search attachments larger than 8 mb
864 maxsize = 8388608
865 matches = set()
866 for attachment in result:
867 if attachment.data.content.filesize > maxsize:
868 continue
869 attachment.data.open()
870 if boyermoore_horspool(attachment.data, text) != -1:
871 matches.add(attachment)
872 attachment.data.close()
873
874 return list(matches)
875
815 def newMessage(self, owner=None, subject=None,876 def newMessage(self, owner=None, subject=None,
816 content=None, parent=None, bugwatch=None,877 content=None, parent=None, bugwatch=None,
817 remote_comment_id=None):878 remote_comment_id=None):
818879
=== added file 'lib/lp/bugs/tests/test_bug_find_attachment.py'
--- lib/lp/bugs/tests/test_bug_find_attachment.py 1970-01-01 00:00:00 +0000
+++ lib/lp/bugs/tests/test_bug_find_attachment.py 2010-06-17 18:08:34 +0000
@@ -0,0 +1,87 @@
1# -*- coding: utf-8 -*-
2# Copyright 2010 Canonical Ltd. This software is licensed under the
3# GNU Affero General Public License version 3 (see the file LICENSE).
4
5__metaclass__ = type
6
7import os
8import transaction
9import urllib
10import unittest
11
12from canonical.config import config
13from canonical.testing import LaunchpadZopelessLayer
14
15from lp.bugs.interfaces.bugjob import BugJobType
16from lp.bugs.model.bugjob import BugJob, BugJobDerived
17from lp.testing import TestCaseWithFactory
18
19from storm.store import EmptyResultSet
20
21from zope.security.proxy import isinstance
22
23class BugFindAttachmentTestCase(TestCaseWithFactory):
24 """Test case for bug attachment searches."""
25
26 layer = LaunchpadZopelessLayer
27
28 def setUp(self):
29 super(BugFindAttachmentTestCase, self).setUp()
30
31 dir = os.path.join(config.root, 'lib/lp/bugs/tests/testfiles')
32 filenames = [
33 'sample-attachment-lorem.txt',
34 'sample-attachment-repeatchars.txt',
35 'sample-attachment-unicode.txt',
36 ]
37 self.searches = [
38 (u'Lorem', u'nulla pariatur', u'sint occaecat cupidatat'),
39 (u'abccab', u'abcabc', u'abcxyzabcxyz', u'aabbccxxyyzz'),
40 (u'νύχτας', u'Απλό γεύματος', u'ειδικά σαν πω.'),
41 ]
42 self.multilinesearches = [
43 (u"est laborum.\n\nEam duis",),
44 (u"abccab\n abcabc",),
45 (u"σαν πω.\n\n\nΌλη",),
46 ]
47 filepaths = [
48 os.path.join(dir, filename) for
49 filename in filenames
50 ]
51 person = self.factory.makePerson()
52 self.bug = self.factory.makeBug()
53
54 self.attachments = []
55 for filename, filepath in zip(filenames, filepaths):
56 attachment = self.factory.makeBugAttachment(
57 bug=self.bug, owner=person, data=file(filepath),
58 comment=filename, filename=filename)
59 self.attachments.append(attachment)
60
61 transaction.commit()
62
63 def test_unicode_type_check(self):
64 self.assertRaises(TypeError, self.bug.findAttachments, 42)
65
66 def test_empty_result_set(self):
67 result = self.bug.findAttachments(u'')
68 self.assertTrue(isinstance(result, EmptyResultSet))
69
70 def test_search_consistency(self):
71 for (searches, attachment) in zip(self.searches, self.attachments):
72 for search in searches:
73 expected = [attachment]
74 result = self.bug.findAttachments(search)
75 self.assertEqual(expected, result)
76
77 def test_multiline_search_consistency(self):
78 for (searches, attachment) in (
79 zip(self.multilinesearches, self.attachments)):
80 for search in searches:
81 search = search.encode('unicode_escape')
82 expected = [attachment]
83 result = self.bug.findAttachments(search, is_escaped=True)
84 self.assertEqual(expected, result)
85
86def test_suite():
87 return unittest.TestLoader().loadTestsFromName(__name__)
088
=== added file 'lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt'
--- lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt 1970-01-01 00:00:00 +0000
+++ lib/lp/bugs/tests/testfiles/sample-attachment-lorem.txt 2010-06-17 18:08:34 +0000
@@ -0,0 +1,4 @@
1Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
2
3Eam duis iisque maluisset cu, minim prodesset qui no. Sit justo volutpat torquatos ex. Fuisset instructior ius te, utinam munere luptatum te nec. Ad quis lorem delenit pro, ut pro maiorum molestiae. Impedit vituperata reprimique sed ne, qui indoctum aliquando adversarium ad. Eu mel congue praesent. His reque ornatus albucius at, ea simul civibus elaboraret vel.
4
05
=== added file 'lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt'
--- lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt 1970-01-01 00:00:00 +0000
+++ lib/lp/bugs/tests/testfiles/sample-attachment-repeatchars.txt 2010-06-17 18:08:34 +0000
@@ -0,0 +1,7 @@
1This file contains special search strings in which patterns repeat:
2
3 abccab
4 abcabc
5 abcxyzabcxyz
6 aabbccxxyyzz
7
08
=== added file 'lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt'
--- lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt 1970-01-01 00:00:00 +0000
+++ lib/lp/bugs/tests/testfiles/sample-attachment-unicode.txt 2010-06-17 18:08:34 +0000
@@ -0,0 +1,5 @@
1Τα όρο νύχτας διευθυντές εκτελούσαν, ας έχω άμεση έκδοση αποστηθίσει. Άτομο αφήσεις επιστρέφουν του δε. Με δείξει έστειλε καθυστερούσε όρο, για' σε πρόσληψη εξακολουθεί επιδιορθώσεις. Απλό γεύματος εργαζόμενοι κι νέα, εγώ υπηρεσία καθορίζουν προγραμματιστής μα. Επί ξέχασε κειμένων οι, υόρκη πιθανό κοιτάζοντας ώρα πω. Στην ειδικά σαν πω.
2
3
4Όλη πόρτες εντυπωσιακό μη, νέο δημιουργια λιγότερους αν. Φράση κάποιο τρόποι κι για, όχι τρόποι δίνοντας δε, αν αρέσει βασανίζουν περισσότερο για'. Μα πιο τελικά δουλεύει, δε νέο δοκιμάσεις ευκολότερο διασφαλίζεται. Οι την πάρεις εργαζόμενοι μεταγλωτιστής, δε δύο πλέον κώδικα καθυστερεί, πες το στήλες διορθώσει. Όσο μη κόλπα παραγωγικής, με κλπ λέει τότε συνηθίζουν, κάποιο είχαμε για το.
5