Bazaar

Merge lp:~jspashett/bzr/183504_latin_2_ignore_file into lp:bzr

183504_latin_2_ignore_file
Merge into bzr.dev

Proposed by Jason Spashett on 2010-03-29

Status:

Merged

Approved by:

Martin Pool on 2010-03-30

Approved revision:

no longer in the source branch.

Merged at revision:

not available

Proposed branch:

lp:~jspashett/bzr/183504_latin_2_ignore_file

Merge into:

lp:bzr

Diff against target:

72 lines (+40/-2)

2 files modified

bzrlib/ignores.py (+28/-2)
bzrlib/tests/test_ignores.py (+12/-0)

To merge this branch:

bzr merge lp:~jspashett/bzr/183504_latin_2_ignore_file

Low

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Martin Pool		Approve on 2010-03-30
John A Meinel	2010-03-29	Approve on 2010-03-29
Review via email: mp+22345@code.launchpad.net

This proposal supersedes a proposal from 2009-10-02.

Revision history for this message

Jason Spashett (jspashett) wrote on 2009-10-02: Posted in a previous version of this proposal

Fix for 183504. Invalid utf in .bzrignore causes stack trace.

How:
parse_ignore_file changed in ignores.py so that each line is utf8 decoded individualy after being split on '\n'. Line count is maintained. UnicodeDecodeError is caught and a trace message is output to signal the error with the offending line number. Processing of further lines continues.
The warning output is:

".bzrignore: On Line %d, malformed utf8 character. Ignoring line."

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-06: Posted in a previous version of this proposal

1) this is really something that would be good to have a test for.
We have tests in "bzrlib/tests/test_ignores.py"

2) Splitting into line-by-line is probably a good way to also handle stuff like bad regexes, etc.
Not something you need to do here, just thinking out loud.

3) I'm slightly concerned about the overhead of calling .decode('utf-8') 100 times, rather than
just 1 time for the common case that the file is all correctly written. (Certainly having a bad
line should be an exceptional case.) Perhaps something like:

   try:
      unicode_lines = content.decode('utf8').splitlines()
   except UnicodeDecodeError:
      lines = content.splitlines()
      unicode_lines = []
      for idx, line in enumerate(lines):
          try:
              unicode_lines.append(line.decode('utf-8'))
          except UnicodeDecodeError:
              # report error about line (idx+1)
    ...

So adding tests is certainly a requirement. Refactoring to make the common case fast would be nice.

It also makes me wonder if we'd want to change something to return the bogus lines (it would make writing tests easier, which usually points to better apis in the real-world...) Think about it. But just adding a test that we get the ignores we expect when there are non-utf8 characters would be ok.

review: Needs Fixing

Revision history for this message

Jason Spashett (jspashett) wrote on 2010-03-29:

As suggested by John Meinel's review:

* Add unit test

* Parse ignore file, if there is a unicode decode error then parse each line and collect all valid lines, giving warnings on lines with invalid utf8.

revs 5123-5124

Revision history for this message

John A Meinel (jameinel) wrote on 2010-03-29:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jason Spashett wrote:
> Jason Spashett has proposed merging lp:~jspashett/bzr/183504_latin_2_ignore_file into lp:bzr.
>
> Requested reviews:
> John A Meinel (jameinel)
> Related bugs:
> #183504 'bzr status' crash if .bzrignore containts Latin-2 chars
> https://bugs.launchpad.net/bugs/183504
>
>

review: approve

Looks good to me.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuw+ZoACgkQJdeBCYSNAAN6eACdH2JZuymMx8fanSFJ18nxf7pa
KIcAnAkHg2u3NueRUgIV2OGa4sUrxDIR
=9UIq
-----END PGP SIGNATURE-----

review: Approve

Revision history for this message

Martin Pool (mbp) wrote on 2010-03-30:

I'll add news and merge it.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

Jason Spashett

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

 === modified file 'bzrlib/ignores.py'
 --- bzrlib/ignores.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/ignores.py	2010-03-29 01:43:17 +0000
@@ -25,6 +25,8 @@
      globbing,
+     )
++from trace import warning
++
  # This was the full ignore list for bzr 0.8
  # please keep these sorted (in C locale order) to aid merging
  OLD_DEFAULTS = [
@@ -100,10 +102,34 @@
+ ]
++
  def parse_ignore_file(f):
--    """Read in all of the lines in the file and turn it into an ignore list"""
++    """Read in all of the lines in the file and turn it into an ignore list
++
++    Continue in the case of utf8 decoding errors, and emit a warning when
++    such and error is found. Optimise for the common case -- no decoding
++    errors.
++    """
      ignored = set()
--    for line in f.read().decode('utf8').split('\n'):
++    ignore_file = f.read()
++    try:
++        # Try and parse whole ignore file at once.
++        unicode_lines = ignore_file.decode('utf8').split('\n')
++    except UnicodeDecodeError:
++        # Otherwise go though line by line and pick out the 'good'
++        # decodable lines
++        lines = ignore_file.split('\n')
++        unicode_lines = []
++        for line_number, line in enumerate(lines):
++            try:
++                unicode_lines.append(line.decode('utf-8'))
++            except UnicodeDecodeError:
++                # report error about line (idx+1)
++                warning('.bzrignore: On Line #%d, malformed utf8 character. '
++                        'Ignoring line.' % (line_number+1))
++
++    # Append each line to ignore list if it's not a comment line
++    for line in unicode_lines:
          line = line.rstrip('\r\n')
          if not line or line.startswith('#'):
              continue
 === modified file 'bzrlib/tests/test_ignores.py'
 --- bzrlib/tests/test_ignores.py	2010-02-23 07:43:11 +0000
 +++ bzrlib/tests/test_ignores.py	2010-03-29 01:43:17 +0000
@@ -50,6 +50,18 @@
      def test_parse_empty(self):
          ignored = ignores.parse_ignore_file(StringIO(''))
          self.assertEqual(set([]), ignored)
++
++    def test_parse_non_utf8(self):
++        """Lines with non utf 8 characters should be discarded."""
++        ignored = ignores.parse_ignore_file(StringIO(
++                'utf8filename_a\n'
++                'invalid utf8\x80\n'
++                'utf8filename_b\n'
++                ))
++        self.assertEqual(set([
++                        'utf8filename_a',
++                        'utf8filename_b',
++                       ]), ignored)
  class TestUserIgnores(TestCaseInTempDir):

Bazaar

Merge lp:~jspashett/bzr/183504_latin_2_ignore_file into lp:bzr

Commit message

Description of the change

Preview Diff

Subscribers