Merge lp:~doxxx/bzr/572092-ignore-dupes into lp:bzr

Proposed by Gordon Tyler on 2010-05-01
Status: Merged
Approved by: Parth Malwankar on 2010-05-04
Approved revision: 5203
Merged at revision: 5207
Proposed branch: lp:~doxxx/bzr/572092-ignore-dupes
Merge into: lp:bzr
Diff against target: 151 lines (+69/-20)
3 files modified
NEWS (+3/-0)
bzrlib/ignores.py (+31/-13)
bzrlib/tests/test_ignores.py (+35/-7)
To merge this branch: bzr merge lp:~doxxx/bzr/572092-ignore-dupes
Reviewer Review Type Date Requested Status
Parth Malwankar 2010-05-01 Approve on 2010-05-04
Alexander Belchenko Approve on 2010-05-04
Review via email: mp+24543@code.launchpad.net

Commit Message

(doxxx) 'ignore' command avoids putting duplicate entry into .bzrignore

Description of the Change

Fixes bug 572092 by changing tree_ignores_add_patterns to check for duplicate patterns in the existing .bzrignore file.

To post a comment you must log in.
Alexander Belchenko (bialix) wrote :

Perhaps, reading the file twice from the disk is not best idea, but we should hit disk cache easily. And this operation is not performance critical one.

Thanks!

review: Approve
Gordon Tyler (doxxx) wrote :

On 5/1/2010 2:12 AM, Alexander Belchenko wrote:
> Perhaps, reading the file twice from the disk is not best idea, but
> we should hit disk cache easily. And this operation is not
> performance critical one.

My initial implementation just read it into the set, updated the set
with the new patterns and then wrote the set back out to disk. The
problem with that was that it changed the order of the patterns in the file.

I could probably read the file once into a StringIO object and then
parse from that...

Parth Malwankar (parthm) wrote :

> I could probably read the file once into a StringIO object and then
> parse from that...

StringIO sounds like a good idea to me, in the edge case that
the file gets modified between reads the buffer will be
consistent for the program.

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Parth Malwankar wrote:
>> I could probably read the file once into a StringIO object and then
>> parse from that...
>
> StringIO sounds like a good idea to me, in the edge case that
> the file gets modified between reads the buffer will be
> consistent for the program.

Why not read into a list with a set check:

patterns = []
pattern_set = set()

for line in disk_file:
  ignore = XXXX(line)
  if ignore not in pattern_set:
    pattern_set.add(ignore)
    patterns.append(ignore)

return patterns

You get filtering and still retain the ordering. Also note that you can
have comments in the ignore file, which we would also like to preserve.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvce8kACgkQJdeBCYSNAAMnUwCgqWg1ZSg9E6Fi/DChM6lZbsrL
vDQAoMIJXrt0sbT/YKOCRRjbEoBnYwYZ
=pz71
-----END PGP SIGNATURE-----

Gordon Tyler (doxxx) wrote :

> Why not read into a list with a set check:
>
> patterns = []
> pattern_set = set()
>
> for line in disk_file:
> ignore = XXXX(line)
> if ignore not in pattern_set:
> pattern_set.add(ignore)
> patterns.append(ignore)
>
> return patterns

I wanted to reuse the parse_ignore_file so that I didn't have to duplicate all the reading and error handling code.

> You get filtering and still retain the ordering. Also note that you can
> have comments in the ignore file, which we would also like to preserve.

Yeah, that was my intention.

Parth Malwankar (parthm) wrote :

I think assertPatternEquals needs to use sorted list instead
of set for tests to catch duplication error. Apart from that
this looks good to me.

    def assertPatternsEquals(self, patterns):
        self.assertEquals(sorted(patterns),
            sorted(open(".bzrignore", 'r').read().strip().split('\n')))

review: Needs Fixing
Gordon Tyler (doxxx) wrote :

Good point and done.

Parth Malwankar (parthm) wrote :

Looks good to me.

review: Approve
Alexander Belchenko (bialix) wrote :

Usually we importing StringIO from c-module cStringIO

from cStringIO import StringIO

Is there any special reason why you prefer python implementation?

Gordon Tyler (doxxx) wrote :

On 5/2/2010 3:28 PM, Alexander Belchenko wrote:
> Usually we importing StringIO from c-module cStringIO
>
> from cStringIO import StringIO
>
> Is there any special reason why you prefer python implementation?

This isn't a particular performance-sensitive situation. I figured
StringIO would be okay. I'm assuming there's a cost to import cStringIO?
Otherwise why would the python implementation exist at all?

Andrew Bennetts (spiv) wrote :

Gordon Tyler wrote:
> On 5/2/2010 3:28 PM, Alexander Belchenko wrote:
> > Usually we importing StringIO from c-module cStringIO
[...]
>
> This isn't a particular performance-sensitive situation. I figured
> StringIO would be okay. I'm assuming there's a cost to import cStringIO?
> Otherwise why would the python implementation exist at all?

Because it's a very old module in the standard library, and people know
better now :)

FWIW, in Python 3, there's no cStringIO any more. There's just
io.StringIO (and io.BytesIO), and they are automatically C
implementations.

Gordon Tyler (doxxx) wrote :

Well, alrighty then. :) cStringIO it is.

Parth Malwankar (parthm) wrote :

> Usually we importing StringIO from c-module cStringIO
>
> from cStringIO import StringIO
>
> Is there any special reason why you prefer python implementation?

I don't know the impact in this specific situation but
IIUC StringIO works with unicode cStringIO does not.

From the cStringIO docs[1]:
> Unlike the memory files implemented by the StringIO
> module, those provided by this module are not able
> to accept Unicode strings that cannot be encoded
> as plain ASCII strings.

[1] http://docs.python.org/library/stringio.html#module-cStringIO

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gordon Tyler wrote:
> Well, alrighty then. :) cStringIO it is.

Well that and because doing stuff like:

cStringIO.StringIO(u'unicode')

will give you a pointer to the *raw* unicode bytes. (So on UCS2
versions, you get '\0u\0n\0i\0c\0o\0d\0e'. And on UCS4 you get
'\0\0\0u\0\0\0n\0\0\0i...')

However, the really good bit is that cStringIO forces it to be in 8-bit
encoding, which is useful when you want to make sure you aren't
switching between 8-bit strings and Unicode strings accidentally.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkveO1EACgkQJdeBCYSNAAP/tQCgmdZqSFM2m2ANRuyzQGT8a0k8
ke4AmQE4OT/78hejuZfBo4DROcKdxuZ/
=oCIc
-----END PGP SIGNATURE-----

Gordon Tyler (doxxx) wrote :

On 5/2/2010 10:57 PM, John A Meinel wrote:
> Well that and because doing stuff like:
>
> cStringIO.StringIO(u'unicode')
>
> will give you a pointer to the *raw* unicode bytes. (So on UCS2
> versions, you get '\0u\0n\0i\0c\0o\0d\0e'. And on UCS4 you get
> '\0\0\0u\0\0\0n\0\0\0i...')
>
> However, the really good bit is that cStringIO forces it to be in 8-bit
> encoding, which is useful when you want to make sure you aren't
> switching between 8-bit strings and Unicode strings accidentally.

The new patterns, which may be Unicode strings, are encoded to UTF-8
before being written out to the file.

However, I added some more tests to see how it deals with non-Unix line
endings in the .bzrignore file and it isn't preserving them properly
when writing out the new contents. So back to the drawing board for me.

Gordon Tyler (doxxx) wrote :

Once Launchpad decides to finish processing my latest push, I believe I have solved the line endings problem and added tests for both that and Unicode handling.

Gordon Tyler (doxxx) wrote :

Hmmm... Launchpad is still processing my last push after 6 hours. Could somebody poke it with a sharp stick?

Alexander Belchenko (bialix) wrote :

Gordon Tyler пишет:
> Hmmm... Launchpad is still processing my last push after 6 hours. Could somebody poke it with a sharp stick?

http://twitter.com/launchpadstatus/status/13298943124

Gordon Tyler (doxxx) wrote :

Okay, looks like it's updated. Give it an eyeball or two?

Alexander Belchenko (bialix) wrote :

Fine for me.

review: Approve
Parth Malwankar (parthm) wrote :

Looks good to me.

review: Approve
Robert Collins (lifeless) wrote :

Launchpad will be slow for another day - its processing all of
maverick (the release after lucid) at the moment. Just a heads up.

Parth Malwankar (parthm) wrote :

> Launchpad will be slow for another day - its processing all of
> maverick (the release after lucid) at the moment. Just a heads up.

Thanks for the heads up Robert.
I will mark this as approved and ready to land, and try to land it tomorrow (once branch refresh is more responsive) in case no one has any further points to discuss on this mp.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2010-05-03 04:08:50 +0000
3+++ NEWS 2010-05-04 00:05:38 +0000
4@@ -63,6 +63,9 @@
5 more comprehensible.
6 (Martin Pool, #491763)
7
8+* ``bzr ignore`` will no longer add duplicate patterns to .bzrignore.
9+ (Gordon Tyler, #572092)
10+
11 Improvements
12 ************
13
14
15=== modified file 'bzrlib/ignores.py'
16--- bzrlib/ignores.py 2010-04-30 07:54:33 +0000
17+++ bzrlib/ignores.py 2010-05-04 00:05:38 +0000
18@@ -17,6 +17,8 @@
19 """Lists of ignore files, etc."""
20
21 import errno
22+import os
23+from cStringIO import StringIO
24
25 import bzrlib
26 from bzrlib import (
27@@ -181,28 +183,44 @@
28 The ignore file will be automatically added under version control.
29
30 :param tree: Working tree to update the ignore list.
31+ :param name_pattern_list: List of ignore patterns.
32+ :return: None
33 """
34+ # read in the existing ignores set
35 ifn = tree.abspath(bzrlib.IGNORE_FILENAME)
36 if tree.has_filename(ifn):
37- f = open(ifn, 'rt')
38+ f = open(ifn, 'rU')
39 try:
40- igns = f.read().decode('utf-8')
41+ file_contents = f.read()
42+ # figure out what kind of line endings are used
43+ newline = getattr(f, 'newlines', None)
44+ if type(newline) is tuple:
45+ newline = newline[0]
46+ elif newline is None:
47+ newline = os.linesep
48 finally:
49 f.close()
50 else:
51- igns = ""
52-
53- # TODO: If the file already uses crlf-style termination, maybe
54- # we should use that for the newly added lines?
55-
56- if igns and igns[-1] != '\n':
57- igns += '\n'
58- for name_pattern in name_pattern_list:
59- igns += name_pattern + '\n'
60-
61+ file_contents = ""
62+ newline = os.linesep
63+
64+ sio = StringIO(file_contents)
65+ try:
66+ ignores = parse_ignore_file(sio)
67+ finally:
68+ sio.close()
69+
70+ # write out the updated ignores set
71 f = atomicfile.AtomicFile(ifn, 'wb')
72 try:
73- f.write(igns.encode('utf-8'))
74+ # write the original contents, preserving original line endings
75+ f.write(newline.join(file_contents.split('\n')))
76+ if len(file_contents) > 0 and not file_contents.endswith('\n'):
77+ f.write(newline)
78+ for pattern in name_pattern_list:
79+ if not pattern in ignores:
80+ f.write(pattern.encode('utf-8'))
81+ f.write(newline)
82 f.commit()
83 finally:
84 f.close()
85
86=== modified file 'bzrlib/tests/test_ignores.py'
87--- bzrlib/tests/test_ignores.py 2009-12-26 21:31:56 +0000
88+++ bzrlib/tests/test_ignores.py 2010-05-04 00:05:38 +0000
89@@ -165,27 +165,55 @@
90
91
92 class TestTreeIgnores(TestCaseWithTransport):
93+
94+ def assertPatternsEquals(self, patterns):
95+ contents = open(".bzrignore", 'rU').read().strip().split('\n')
96+ self.assertEquals(sorted(patterns), sorted(contents))
97
98 def test_new_file(self):
99 tree = self.make_branch_and_tree(".")
100 ignores.tree_ignores_add_patterns(tree, ["myentry"])
101 self.assertTrue(tree.has_filename(".bzrignore"))
102- self.assertEquals("myentry\n",
103- open(".bzrignore", 'r').read())
104+ self.assertPatternsEquals(["myentry"])
105
106 def test_add_to_existing(self):
107 tree = self.make_branch_and_tree(".")
108 self.build_tree_contents([('.bzrignore', "myentry1\n")])
109 tree.add([".bzrignore"])
110 ignores.tree_ignores_add_patterns(tree, ["myentry2", "foo"])
111- self.assertEquals("myentry1\nmyentry2\nfoo\n",
112- open(".bzrignore", 'r').read())
113+ self.assertPatternsEquals(["myentry1", "myentry2", "foo"])
114
115 def test_adds_ending_newline(self):
116 tree = self.make_branch_and_tree(".")
117 self.build_tree_contents([('.bzrignore', "myentry1")])
118 tree.add([".bzrignore"])
119 ignores.tree_ignores_add_patterns(tree, ["myentry2"])
120- self.assertEquals("myentry1\nmyentry2\n",
121- open(".bzrignore", 'r').read())
122-
123+ self.assertPatternsEquals(["myentry1", "myentry2"])
124+ text = open(".bzrignore", 'r').read()
125+ self.assertTrue(text.endswith('\r\n') or
126+ text.endswith('\n') or
127+ text.endswith('\r'))
128+
129+ def test_does_not_add_dupe(self):
130+ tree = self.make_branch_and_tree(".")
131+ self.build_tree_contents([('.bzrignore', "myentry\n")])
132+ tree.add([".bzrignore"])
133+ ignores.tree_ignores_add_patterns(tree, ["myentry"])
134+ self.assertPatternsEquals(["myentry"])
135+
136+ def test_non_ascii(self):
137+ tree = self.make_branch_and_tree(".")
138+ self.build_tree_contents([('.bzrignore',
139+ u"myentry\u1234\n".encode('utf-8'))])
140+ tree.add([".bzrignore"])
141+ ignores.tree_ignores_add_patterns(tree, [u"myentry\u5678"])
142+ self.assertPatternsEquals([u"myentry\u1234".encode('utf-8'),
143+ u"myentry\u5678".encode('utf-8')])
144+
145+ def test_crlf(self):
146+ tree = self.make_branch_and_tree(".")
147+ self.build_tree_contents([('.bzrignore', "myentry1\r\n")])
148+ tree.add([".bzrignore"])
149+ ignores.tree_ignores_add_patterns(tree, ["myentry2", "foo"])
150+ self.assertEquals(open('.bzrignore', 'rb').read(), 'myentry1\r\nmyentry2\r\nfoo\r\n')
151+ self.assertPatternsEquals(["myentry1", "myentry2", "foo"])