Merge lp:~jspashett/bzr/183504_latin_2_ignore_file into lp:bzr

Proposed by Jason Spashett
Status: Merged
Approved by: Martin Pool
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~jspashett/bzr/183504_latin_2_ignore_file
Merge into: lp:bzr
Diff against target: 72 lines (+40/-2)
2 files modified
bzrlib/ignores.py (+28/-2)
bzrlib/tests/test_ignores.py (+12/-0)
To merge this branch: bzr merge lp:~jspashett/bzr/183504_latin_2_ignore_file
Reviewer Review Type Date Requested Status
Martin Pool Approve
John A Meinel Approve
Review via email: mp+22345@code.launchpad.net

This proposal supersedes a proposal from 2009-10-02.

To post a comment you must log in.
Revision history for this message
Jason Spashett (jspashett) wrote : Posted in a previous version of this proposal

Fix for 183504. Invalid utf in .bzrignore causes stack trace.

How:
parse_ignore_file changed in ignores.py so that each line is utf8 decoded individualy after being split on '\n'. Line count is maintained. UnicodeDecodeError is caught and a trace message is output to signal the error with the offending line number. Processing of further lines continues.
The warning output is:

".bzrignore: On Line %d, malformed utf8 character. Ignoring line."

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

1) this is really something that would be good to have a test for.
   We have tests in "bzrlib/tests/test_ignores.py"

2) Splitting into line-by-line is probably a good way to also handle stuff like bad regexes, etc.
   Not something you need to do here, just thinking out loud.

3) I'm slightly concerned about the overhead of calling .decode('utf-8') 100 times, rather than
   just 1 time for the common case that the file is all correctly written. (Certainly having a bad
   line should be an exceptional case.) Perhaps something like:

   try:
      unicode_lines = content.decode('utf8').splitlines()
   except UnicodeDecodeError:
      lines = content.splitlines()
      unicode_lines = []
      for idx, line in enumerate(lines):
          try:
              unicode_lines.append(line.decode('utf-8'))
          except UnicodeDecodeError:
              # report error about line (idx+1)
    ...

So adding tests is certainly a requirement. Refactoring to make the common case fast would be nice.

It also makes me wonder if we'd want to change something to return the bogus lines (it would make writing tests easier, which usually points to better apis in the real-world...) Think about it. But just adding a test that we get the ignores we expect when there are non-utf8 characters would be ok.

review: Needs Fixing
Revision history for this message
Jason Spashett (jspashett) wrote :

As suggested by John Meinel's review:

* Add unit test

* Parse ignore file, if there is a unicode decode error then parse each line and collect all valid lines, giving warnings on lines with invalid utf8.

revs 5123-5124

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jason Spashett wrote:
> Jason Spashett has proposed merging lp:~jspashett/bzr/183504_latin_2_ignore_file into lp:bzr.
>
> Requested reviews:
> John A Meinel (jameinel)
> Related bugs:
> #183504 'bzr status' crash if .bzrignore containts Latin-2 chars
> https://bugs.launchpad.net/bugs/183504
>
>

 review: approve

Looks good to me.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuw+ZoACgkQJdeBCYSNAAN6eACdH2JZuymMx8fanSFJ18nxf7pa
KIcAnAkHg2u3NueRUgIV2OGa4sUrxDIR
=9UIq
-----END PGP SIGNATURE-----

review: Approve
Revision history for this message
Martin Pool (mbp) wrote :

I'll add news and merge it.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/ignores.py'
2--- bzrlib/ignores.py 2009-03-23 14:59:43 +0000
3+++ bzrlib/ignores.py 2010-03-29 01:43:17 +0000
4@@ -25,6 +25,8 @@
5 globbing,
6 )
7
8+from trace import warning
9+
10 # This was the full ignore list for bzr 0.8
11 # please keep these sorted (in C locale order) to aid merging
12 OLD_DEFAULTS = [
13@@ -100,10 +102,34 @@
14 ]
15
16
17+
18 def parse_ignore_file(f):
19- """Read in all of the lines in the file and turn it into an ignore list"""
20+ """Read in all of the lines in the file and turn it into an ignore list
21+
22+ Continue in the case of utf8 decoding errors, and emit a warning when
23+ such and error is found. Optimise for the common case -- no decoding
24+ errors.
25+ """
26 ignored = set()
27- for line in f.read().decode('utf8').split('\n'):
28+ ignore_file = f.read()
29+ try:
30+ # Try and parse whole ignore file at once.
31+ unicode_lines = ignore_file.decode('utf8').split('\n')
32+ except UnicodeDecodeError:
33+ # Otherwise go though line by line and pick out the 'good'
34+ # decodable lines
35+ lines = ignore_file.split('\n')
36+ unicode_lines = []
37+ for line_number, line in enumerate(lines):
38+ try:
39+ unicode_lines.append(line.decode('utf-8'))
40+ except UnicodeDecodeError:
41+ # report error about line (idx+1)
42+ warning('.bzrignore: On Line #%d, malformed utf8 character. '
43+ 'Ignoring line.' % (line_number+1))
44+
45+ # Append each line to ignore list if it's not a comment line
46+ for line in unicode_lines:
47 line = line.rstrip('\r\n')
48 if not line or line.startswith('#'):
49 continue
50
51=== modified file 'bzrlib/tests/test_ignores.py'
52--- bzrlib/tests/test_ignores.py 2010-02-23 07:43:11 +0000
53+++ bzrlib/tests/test_ignores.py 2010-03-29 01:43:17 +0000
54@@ -50,6 +50,18 @@
55 def test_parse_empty(self):
56 ignored = ignores.parse_ignore_file(StringIO(''))
57 self.assertEqual(set([]), ignored)
58+
59+ def test_parse_non_utf8(self):
60+ """Lines with non utf 8 characters should be discarded."""
61+ ignored = ignores.parse_ignore_file(StringIO(
62+ 'utf8filename_a\n'
63+ 'invalid utf8\x80\n'
64+ 'utf8filename_b\n'
65+ ))
66+ self.assertEqual(set([
67+ 'utf8filename_a',
68+ 'utf8filename_b',
69+ ]), ignored)
70
71
72 class TestUserIgnores(TestCaseInTempDir):