Merge lp:~songofacandy/bzr/fix-523746-2 into lp:bzr/2.0

Proposed by methane
Status: Rejected
Rejected by: Vincent Ladeuil
Proposed branch: lp:~songofacandy/bzr/fix-523746-2
Merge into: lp:bzr/2.0
Diff against target: 52 lines (+23/-5)
1 file modified
bzrlib/diff.py (+23/-5)
To merge this branch: bzr merge lp:~songofacandy/bzr/fix-523746-2
Reviewer Review Type Date Requested Status
Vincent Ladeuil Needs Information
Martin Packman (community) Needs Information
Review via email: mp+19938@code.launchpad.net
To post a comment you must log in.
Revision history for this message
methane (songofacandy) wrote :

Make sure that temporary file name is ASCII.

Revision history for this message
Martin Packman (gz) wrote :

I think the basic idea is sound here, just a couple of questions.

Are the mangled filenames going to be in the output of the external tool? If so there needs to be some unmangling code.

Can relpath be a string like u"docs/マニュアル/index.html"? If so we risk losing path components, should use 'replace' for encoding errors and then do .replace("?", "_") to make it windows-happy.

Could use `sys.getfilesystemencoding() or 'ascii'` rather than just 'ascii' as the encoding. Likewise could use `(path) or 'FILE'` rather than `'_'+(path)` to avoid the empty string.

review: Needs Information
Revision history for this message
methane (songofacandy) wrote :

> Are the mangled filenames going to be in the output of the external tool? If
> so there needs to be some unmangling code.

These filenames are just an input of the external diff tool.

> Can relpath be a string like u"docs/マニュアル/index.html"? If so we risk losing
> path components, should use 'replace' for encoding errors and then do
> .replace("?", "_") to make it windows-happy.

You're right! Adding "_" prefix is not enough.

> Could use `sys.getfilesystemencoding() or 'ascii'` rather than just 'ascii' as
> the encoding. Likewise could use `(path) or 'FILE'` rather than `'_'+(path)`
> to avoid the empty string.

I've made it.
Please review again.

Revision history for this message
Martin Packman (gz) wrote :

Ack, was that big a follow up needed? I only intended to suggest something like:

- full_path = osutils.pathjoin(self._root, prefix, relpath)
+ full_path = osutils.pathjoin(self._root, prefix, relpath.encode(
+ sys.getfilesystemencoding() or 'ascii', 'replace').replace(
+ "?", "_"))

The new encoding-then-decoding seems redundant, as do the changes in `_get_command`, what prompted you to add that when the last version was so simple?

Revision history for this message
methane (songofacandy) wrote :

> Ack, was that big a follow up needed? I only intended to suggest something
> like:
>
> - full_path = osutils.pathjoin(self._root, prefix, relpath)
> + full_path = osutils.pathjoin(self._root, prefix, relpath.encode(
> + sys.getfilesystemencoding() or 'ascii', 'replace').replace(
> + "?", "_"))
>
> The new encoding-then-decoding seems redundant, as do the changes in
> `_get_command`, what prompted you to add that when the last version was so
> simple?

Multibyte encoding may contains '?' as a trailing char.
So ``full_path.encode('mbcs', 'replace').replace('?', '_')`` may break
multibyte char.

And I hate self._root or prefix is changed through encode().decode() and bzr
attempts to write outside of _root.
So I do:
1) Write temporary file with safe unicode path.
2) Launch external diff with encoded path.

Revision history for this message
Martin Packman (gz) wrote :

> Multibyte encoding may contains '?' as a trailing char.
> So ``full_path.encode('mbcs', 'replace').replace('?', '_')`` may break
> multibyte char.

Really? Even SJIS avoids that one, and it's pretty good at breaking naïve programs. I get the reasoning now though.

> And I hate self._root or prefix is changed through encode().decode() and bzr
> attempts to write outside of _root.
> So I do:
> 1) Write temporary file with safe unicode path.
> 2) Launch external diff with encoded path.

Hm, I see. Probably overly cautious though, the windows filesystem layer will just be decoding by the system locale internally anyway.

Revision history for this message
methane (songofacandy) wrote :

> > Multibyte encoding may contains '?' as a trailing char.
> > So ``full_path.encode('mbcs', 'replace').replace('?', '_')`` may break
> > multibyte char.
>
> Really? Even SJIS avoids that one, and it's pretty good at breaking naïve
> programs. I get the reasoning now though.

I've checked SJIS and BIG5 a little while ago, and find '?' (0x3f) is not in trailing char.
So replace('?', '_') may be safe.

But I doesn't like depending character mapping.
I think encoded string is magic bytes and modifying it is dangerous.

Revision history for this message
Martin Packman (gz) wrote :

> But I doesn't like depending character mapping.
> I think encoded string is magic bytes and modifying it is dangerous.

Yeah, fair enough. Perhaps dealing with this just for tempfiles is too much and sticking to ascii-only for the mangling is the best choice.

Revision history for this message
methane (songofacandy) wrote :

> Perhaps dealing with this just for tempfiles is too much
> and sticking to ascii-only for the mangling is the best choice.

So I use ascii before.
But I find filename of tempfile is shown on GUI Diff title.
"old/????/????.txt - new/????/????.txt" isn't cool.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin [gz] wrote:
> Ack, was that big a follow up needed? I only intended to suggest something like:
>
> - full_path = osutils.pathjoin(self._root, prefix, relpath)
> + full_path = osutils.pathjoin(self._root, prefix, relpath.encode(
> + sys.getfilesystemencoding() or 'ascii', 'replace').replace(
> + "?", "_"))

^- This should probably be 'osutils._fs_enc' where we already handle fs
encoding of None.

>
> The new encoding-then-decoding seems redundant, as do the changes in `_get_command`, what prompted you to add that when the last version was so simple?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuH3lsACgkQJdeBCYSNAANLJACgvw/A8y0EKW4chN31ewdmof2r
oqoAn3NAsSjyZ7vcvcdL6sOLKEjgg2Tq
=eWJM
-----END PGP SIGNATURE-----

Revision history for this message
methane (songofacandy) wrote :

> Martin [gz] wrote:
> > Ack, was that big a follow up needed? I only intended to suggest something
> like:
> >
> > - full_path = osutils.pathjoin(self._root, prefix, relpath)
> > + full_path = osutils.pathjoin(self._root, prefix, relpath.encode(
> > + sys.getfilesystemencoding() or 'ascii', 'replace').replace(
> > + "?", "_"))
>
> ^- This should probably be 'osutils._fs_enc' where we already handle fs
> encoding of None.

subprocess.Popen uses os.execv(), and os.execv() use Py_FileSystemDefaultEncoding (=sys.getfilesystemencoding()) to encode arguments. When filesystemencoding is NULL, fallbacks to defaultencoding.

osutils._fs_enc fallbacks to 'utf-8'. So using osutils._fs_enc may cause UnicodeEncodeError on Popen.
# I think osutils._fs_enc fallbacks to utf-8 is good because I hope bzr uses
# utf-8 when I set LANG=C.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

INADA Naoki wrote:
>> Martin [gz] wrote:
>>> Ack, was that big a follow up needed? I only intended to suggest something
>> like:
>>> - full_path = osutils.pathjoin(self._root, prefix, relpath)
>>> + full_path = osutils.pathjoin(self._root, prefix, relpath.encode(
>>> + sys.getfilesystemencoding() or 'ascii', 'replace').replace(
>>> + "?", "_"))
>> ^- This should probably be 'osutils._fs_enc' where we already handle fs
>> encoding of None.
>
> subprocess.Popen uses os.execv(), and os.execv() use Py_FileSystemDefaultEncoding (=sys.getfilesystemencoding()) to encode arguments. When filesystemencoding is NULL, fallbacks to defaultencoding.
>
> osutils._fs_enc fallbacks to 'utf-8'. So using osutils._fs_enc may cause UnicodeEncodeError on Popen.
> # I think osutils._fs_enc fallbacks to utf-8 is good because I hope bzr uses
> # utf-8 when I set LANG=C.

If you set LANG=C then we see fs_enc as ascii, not None.

The change was primarily for BSD, where fs_enc was None, and we wanted
UTF-8 as the most reasonable option.

However, if execv is going to fail if we pass something other than
ascii, then fair enough.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuH/RkACgkQJdeBCYSNAAPZdACgo0Dy748l7zTEzTElE5Gj8tL5
ZnMAoJO+qwFcz1H92xlmlDNt2VoskBII
=W/Bm
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

This sounds far too controversial for landing in 2.0 (especially without tests), can we have a summary from Naoki and reviewers about landing this patch into bzr.dev instead ?

review: Needs Information
Revision history for this message
methane (songofacandy) wrote :

This problem depends on OS, locale settings and other environment.
So I agree with you that this patch should land to bzr.dev and be tested on
vary environment while bzr 2.2 is released.

Revision history for this message
methane (songofacandy) wrote :

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/diff.py'
2--- bzrlib/diff.py 2009-07-29 21:35:05 +0000
3+++ bzrlib/diff.py 2010-02-26 11:17:14 +0000
4@@ -682,7 +682,17 @@
5
6 def _get_command(self, old_path, new_path):
7 my_map = {'old_path': old_path, 'new_path': new_path}
8- return [t % my_map for t in self.command_template]
9+ command = [t % my_map for t in self.command_template]
10+ if sys.platform == 'win32': # Popen doesn't accept unicode on win32
11+ command_encoded = []
12+ for c in command:
13+ if isinstance(c, unicode):
14+ command_encoded.append(c.encode('mbcs'))
15+ else:
16+ command_encoded.append(c)
17+ return command_encoded
18+ else:
19+ return command
20
21 def _execute(self, old_path, new_path):
22 command = self._get_command(old_path, new_path)
23@@ -709,7 +719,15 @@
24 return True
25
26 def _write_file(self, file_id, tree, prefix, relpath):
27- full_path = osutils.pathjoin(self._root, prefix, relpath)
28+ if sys.platform == 'win32':
29+ fenc = 'mbcs'
30+ else:
31+ fenc = sys.getfilesystemencoding()
32+ # encoded_str.replace('?', '_') may break multibyte char.
33+ # So we should encode, decode, then replace(u'?', u'_')
34+ relpath_tmp = relpath.encode(fenc, 'replace').decode(fenc, 'replace')
35+ relpath_tmp = relpath_tmp.replace(u'?', u'_')
36+ full_path = osutils.pathjoin(self._root, prefix, relpath_tmp)
37 if self._try_symlink_root(tree, prefix):
38 return full_path
39 parent_dir = osutils.dirname(full_path)
40@@ -750,9 +768,9 @@
41 def diff(self, file_id, old_path, new_path, old_kind, new_kind):
42 if (old_kind, new_kind) != ('file', 'file'):
43 return DiffPath.CANNOT_DIFF
44- self._prepare_files(file_id, old_path, new_path)
45- self._execute(osutils.pathjoin('old', old_path),
46- osutils.pathjoin('new', new_path))
47+ old_disk_path, new_disk_path = \
48+ self._prepare_files(file_id, old_path, new_path)
49+ self._execute(old_disk_path, new_disk_path)
50
51
52 class DiffTree(object):

Subscribers

People subscribed via source and target branches