Merge lp:~gz/bzr-grep/no_match_fast_path into lp:bzr-grep

Proposed by Martin Packman
Status: Merged
Approved by: Parth Malwankar
Approved revision: 157
Merged at revision: 139
Proposed branch: lp:~gz/bzr-grep/no_match_fast_path
Merge into: lp:bzr-grep
Prerequisite: lp:~gz/bzr-grep/preformat_output
Diff against target: 156 lines (+67/-15)
4 files modified
NEWS (+6/-0)
__init__.py (+2/-2)
grep.py (+53/-13)
test_grep.py (+6/-0)
To merge this branch: bzr merge lp:~gz/bzr-grep/no_match_fast_path
Reviewer Review Type Date Requested Status
Parth Malwankar Approve
Review via email: mp+26874@code.launchpad.net

Description of the change

Optimisation, avoids splitting the text into a list of lines and testing each one for a match. There are actually a couple of cases where we have avoid this anyway, as explained in the comments, and currently always fall back to splitting into lines after a match is found.

Timings of most interest here will be a big tree with only a few files hit by the pattern.

To post a comment you must log in.
Revision history for this message
Parth Malwankar (parthm) wrote :

This is a really neat optimization. Thanks.
The hot cache numbers for emacs tree working tree search go from 2.39-2.42s to 0.93-0.96s for a pattern thats not found. One we merge in the prerequisite branch we can merge this in. Please add a NEWS entry.

review: Approve
lp:~gz/bzr-grep/no_match_fast_path updated
158. By Martin Packman

Fix previously untested bug with regexp and line numbers introduced by optimisation

159. By Martin Packman

Scale back no-match fast path to avoid some behaviour changes with line endings

Revision history for this message
Martin Packman (gz) wrote :

Even with the current checks, this branch changes some corner-cases with line-endings. The existing behaviour isn't all that great, but should really decide on what we want the plugin to do and make it consistent.

I've removed a little of the code for the moment, so the first match will happen twice, but the basic speedup should be the same, and the landing risk should be lower.

lp:~gz/bzr-grep/no_match_fast_path updated
160. By Martin Packman

Add NEWS entry for no match fast path change

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2010-05-23 07:59:31 +0000
3+++ NEWS 2010-06-07 19:36:31 +0000
4@@ -1,3 +1,9 @@
5+bzr-grep 0.4.0-dev
6+==================================
7+* Added fast path for no match that avoids splitting the file text into
8+ seperate lines and testing each one, by checking the entire text for a
9+ possible match initially. (Martin [gz])
10+
11 bzr-grep 0.3.0-final - 23-May-2010
12 ==================================
13 * Support for --color option (POSIX only). (Parth Malwankar, #571694)
14
15=== modified file '__init__.py'
16--- __init__.py 2010-06-07 19:36:31 +0000
17+++ __init__.py 2010-06-07 19:36:31 +0000
18@@ -198,9 +198,9 @@
19 pattern = re.escape(pattern)
20
21 patternc = None
22- re_flags = 0
23+ re_flags = re.MULTILINE
24 if ignore_case:
25- re_flags = re.IGNORECASE
26+ re_flags |= re.IGNORECASE
27
28 if not fixed_string:
29 patternc = grep.compile_pattern(pattern, re_flags)
30
31=== modified file 'grep.py'
32--- grep.py 2010-06-07 19:36:31 +0000
33+++ grep.py 2010-06-07 19:36:31 +0000
34@@ -14,6 +14,9 @@
35 # along with this program; if not, write to the Free Software
36 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
37
38+import os
39+import sys
40+
41 from bzrlib.lazy_import import lazy_import
42 lazy_import(globals(), """
43 from fnmatch import fnmatch
44@@ -469,38 +472,75 @@
45 writeline = opts.outputter.get_writer(path, revno, cache_id)
46
47 if opts.files_with_matches or opts.files_without_match:
48- # While printing files with matches we only have two case
49- # print file name or print file name with revno.
50- found = False
51 if opts.fixed_string:
52- for line in file_text.splitlines():
53- if pattern in line:
54- found = True
55- break
56+ if sys.platform > (2, 5):
57+ found = pattern in file_text
58+ else:
59+ for line in file_text.splitlines():
60+ if pattern in line:
61+ found = True
62+ break
63+ else:
64+ found = False
65 else:
66 search = opts.patternc.search
67- for line in file_text.splitlines():
68- if search(line):
69- found = True
70- break
71+ if "$" not in pattern:
72+ found = search(file_text) is not None
73+ else:
74+ for line in file_text.splitlines():
75+ if search(line):
76+ found = True
77+ break
78+ else:
79+ found = False
80 if (opts.files_with_matches and found) or \
81 (opts.files_without_match and not found):
82 writeline()
83 elif opts.fixed_string:
84+ # Fast path for no match, search through the entire file at once rather
85+ # than a line at a time. However, we don't want this without Python 2.5
86+ # as the quick string search algorithm wasn't implemented till then:
87+ # <http://effbot.org/zone/stringlib.htm>
88+ if sys.version_info > (2, 5):
89+ i = file_text.find(pattern)
90+ if i == -1:
91+ return
92+ b = file_text.rfind("\n", 0, i) + 1
93+ if opts.line_number:
94+ start = file_text.count("\n", 0, b) + 1
95+ file_text = file_text[b:]
96+ else:
97+ start = 1
98 if opts.line_number:
99 for index, line in enumerate(file_text.splitlines()):
100 if pattern in line:
101- writeline(lineno=index+1, line=line)
102+ writeline(lineno=index+start, line=line)
103 else:
104 for line in file_text.splitlines():
105 if pattern in line:
106 writeline(line=line)
107 else:
108+ # Fast path on no match, the re module avoids bad behaviour in most
109+ # standard cases, but perhaps could try and detect backtracking
110+ # patterns here and avoid whole text search in those cases
111 search = opts.patternc.search
112+ if "$" not in pattern:
113+ # GZ 2010-06-05: Grr, re.MULTILINE can't save us when searching
114+ # through revisions as bazaar returns binary mode
115+ # and trailing \r breaks $ as line ending match
116+ m = search(file_text)
117+ if m is None:
118+ return
119+ b = file_text.rfind("\n", 0, m.start()) + 1
120+ if opts.line_number:
121+ start = file_text.count("\n", 0, b) + 1
122+ file_text = file_text[b:]
123+ else:
124+ start = 1
125 if opts.line_number:
126 for index, line in enumerate(file_text.splitlines()):
127 if search(line):
128- writeline(lineno=index+1, line=line)
129+ writeline(lineno=index+start, line=line)
130 else:
131 for line in file_text.splitlines():
132 if search(line):
133
134=== modified file 'test_grep.py'
135--- test_grep.py 2010-06-07 19:36:31 +0000
136+++ test_grep.py 2010-06-07 19:36:31 +0000
137@@ -1082,6 +1082,9 @@
138 '-n', 'line1', 'file0.txt'])
139 self.assertContainsRe(out, "file0.txt~.:1:line1", flags=TestGrep._reflags)
140
141+ out, err = self.run_bzr(['grep', '-n', 'line[0-9]', 'file0.txt'])
142+ self.assertContainsRe(out, "file0.txt:3:line3", flags=TestGrep._reflags)
143+
144 def test_wtree_with_line_number(self):
145 """(wtree) Search for pattern with --line-number.
146 """
147@@ -1099,6 +1102,9 @@
148 out, err = self.run_bzr(['grep', '-n', '[hjkl]ine1', 'file0.txt'])
149 self.assertContainsRe(out, "file0.txt:1:line1", flags=TestGrep._reflags)
150
151+ out, err = self.run_bzr(['grep', '-n', 'line[0-9]', 'file0.txt'])
152+ self.assertContainsRe(out, "file0.txt:3:line3", flags=TestGrep._reflags)
153+
154 def test_revno_basic_history_grep_file(self):
155 """Search for pattern in specific revision number in a file.
156 """

Subscribers

People subscribed via source and target branches