Merge lp:~gz/bzr/filesystem_default_encoding_794353 into lp:bzr

Proposed by Martin Packman
Status: Merged
Approved by: Jelmer Vernooij
Approved revision: no longer in the source branch.
Merged at revision: 6367
Proposed branch: lp:~gz/bzr/filesystem_default_encoding_794353
Merge into: lp:bzr
Diff against target: 204 lines (+71/-16)
7 files modified
bzr (+4/-0)
bzrlib/__init__.py (+42/-0)
bzrlib/osutils.py (+2/-5)
bzrlib/tests/blackbox/test_exceptions.py (+8/-0)
bzrlib/tests/test_osutils.py (+3/-11)
doc/en/release-notes/bzr-2.5.txt (+5/-0)
doc/en/whats-new/whats-new-in-2.5.txt (+7/-0)
To merge this branch: bzr merge lp:~gz/bzr/filesystem_default_encoding_794353
Reviewer Review Type Date Requested Status
Vincent Ladeuil Approve
Jelmer Vernooij (community) code Approve
Review via email: mp+85170@code.launchpad.net

Commit message

Override Py_FileSystemDefaultEncoding to utf-8 from ascii for the bzr script

Description of the change

Use dark ctypes hackery to change the encoding Python uses when interacting with the filesystem on non-OSX posix machines. This affects open() and most os module functions when passed a unicode filename. Existing locale settings are obeyed, but if an invalid locale or the C locale is given, UTF-8 will be used instead of ascii for encoding and decoding filenames and such like.

Looking through the Python 2.7 source, I believe this is a safe operation. The Py_FileSystemDefaultEncoding global is declared in Python/bltinmodule.c and initialised on posix systems in Python/pythonrun.c if a valid locale is given. It is accessed in Objects/fileobject.c for open(), in Python/import.c for imp.find_module(), in Python/sysmodule.c to return the value with sys.getfilesystemencoding(), in Modules/_io/fileio.c for _io.FileIO() which is the Python 3 io backport, and throughout Modules/posixmodule.c for all the os functions. Nothing looks like it will cause issues if the value is changed underneath. On the bzrlib side, doing this early in startup should avoid any potential problems.

Initially this is likely to just expose a lot of areas where the ui does not handle encoding fallbacks correctly, and the user encoding will still be ascii. For instance, in the test, replacing "init" with "init-repo" results in a UnicodeEncodeError. I don't think changing the user encoding default as well is necessarily the correct thing, and most issues like this are existing usability problems when running on a non-utf-8 terminal.

To post a comment you must log in.
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Do we really want to do this in bzrlib.__init__ ? This will also influence other applications that happen to use bzrlib. Changing the filesystem encoding behind their back seems wrong.

Perhaps it would make more sense to do this in the 'bzr' script?

review: Needs Information
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Garr, nevermind. Jelmer skimming fail. I will lay off the coffee and try to read slower. Sorry.

Revision history for this message
Jelmer Vernooij (jelmer) :
review: Approve (code)
Revision history for this message
Vincent Ladeuil (vila) wrote :

Based on the tests and the decrease of roundtrips: approved !

16 + def _vfs_checkout_metadir(self):
17 + self._ensure_real()
18 + return self._real_bzrdir.checkout_metadir()
19 +
20 + def checkout_metadir(self):
21 + medium = self._client._medium
22 + if medium._is_remote_before((2, 5)):
23 + return self._vfs_checkout_metadir()

_vfs_checkout_metadir -> checkout_metadir -> _vfs_checkout_metadir ?

Is that a loop or do I miss something obvious ?

Up to you to wait for spiv's review.

review: Approve
Revision history for this message
Vincent Ladeuil (vila) wrote :

Meh, wrong proposal.

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzr'
2--- bzr 2011-10-27 15:38:14 +0000
3+++ bzr 2011-12-14 17:09:40 +0000
4@@ -84,6 +84,10 @@
5 ' Although this should be no problem for bzr itself, it might\n'
6 ' cause problems with some plugins. To investigate the issue,\n'
7 ' look at the output of the locale(1p) tool.\n' % e)
8+ # Use better default than ascii with posix filesystems that deal in bytes
9+ # natively even when the C locale or no locale at all is given. Note that
10+ # we need an immortal string for the hack, hence the lack of a hyphen.
11+ sys._bzr_default_fs_enc = "utf8"
12
13
14 # The python2.6 release includes some libraries that have deprecation warnings
15
16=== modified file 'bzrlib/__init__.py'
17--- bzrlib/__init__.py 2011-12-08 09:24:06 +0000
18+++ bzrlib/__init__.py 2011-12-14 17:09:40 +0000
19@@ -37,6 +37,7 @@
20 # timestamps relative to program start in the log file kept by bzrlib.trace.
21 _start_time = time.time()
22
23+import codecs
24 import sys
25
26
27@@ -131,6 +132,47 @@
28 __version__ = _format_version_tuple(version_info)
29 version_string = __version__
30
31+
32+def _patch_filesystem_default_encoding(new_enc):
33+ """Change the Python process global encoding for filesystem names
34+
35+ The effect is to change how open() and other builtin functions handle
36+ unicode filenames on posix systems. This should only be done near startup.
37+
38+ The new encoding string passed to this function must survive until process
39+ termination, otherwise the interpreter may access uninitialized memory.
40+ The use of intern() may defer breakage is but is not enough, the string
41+ object should be secure against module reloading and during teardown.
42+ """
43+ try:
44+ import ctypes
45+ except ImportError:
46+ return
47+ pythonapi = getattr(ctypes, "pythonapi", None)
48+ if pythonapi is None:
49+ # Not CPython ctypes implementation
50+ return
51+ old_ptr = ctypes.c_void_p.in_dll(pythonapi, "Py_FileSystemDefaultEncoding")
52+ new_ptr = ctypes.cast(ctypes.c_char_p(intern(new_enc)), ctypes.c_void_p)
53+ old_ptr.value = new_ptr.value
54+ if sys.getfilesystemencoding() != new_enc:
55+ raise RuntimeError("Failed to change the filesystem default encoding")
56+ return new_enc
57+
58+
59+# When running under the bzr script, override bad filesystem default encoding.
60+# This is not safe to do for all users of bzrlib, other scripts should instead
61+# just ensure a usable locale is set via the $LANG variable on posix systems.
62+_fs_enc = sys.getfilesystemencoding()
63+if getattr(sys, "_bzr_default_fs_enc", None) is not None:
64+ if (_fs_enc is None or codecs.lookup(_fs_enc).name == "ascii"):
65+ _fs_enc = _patch_filesystem_default_encoding(sys._bzr_default_fs_enc)
66+if _fs_enc is None:
67+ _fs_enc = "ascii"
68+else:
69+ _fs_enc = codecs.lookup(_fs_enc).name
70+
71+
72 # bzr has various bits of global state that are slowly being eliminated.
73 # This variable is intended to permit any new state-like things to be attached
74 # to a library_state.BzrLibraryState object rather than getting new global
75
76=== modified file 'bzrlib/osutils.py'
77--- bzrlib/osutils.py 2011-12-05 14:21:55 +0000
78+++ bzrlib/osutils.py 2011-12-14 17:09:40 +0000
79@@ -63,7 +63,7 @@
80
81
82 import bzrlib
83-from bzrlib import symbol_versioning
84+from bzrlib import symbol_versioning, _fs_enc
85
86
87 # Cross platform wall-clock time functionality with decent resolution.
88@@ -293,7 +293,6 @@
89 # choke on a Unicode string containing a relative path if
90 # os.getcwd() returns a non-sys.getdefaultencoding()-encoded
91 # string.
92-_fs_enc = sys.getfilesystemencoding() or 'utf-8'
93 def _posix_abspath(path):
94 # jam 20060426 rather than encoding to fsencoding
95 # copy posixpath.abspath, but use os.getcwdu instead
96@@ -1771,7 +1770,6 @@
97 """
98 global _selected_dir_reader
99 if _selected_dir_reader is None:
100- fs_encoding = _fs_enc.upper()
101 if sys.platform == "win32" and win32utils.winver == 'Windows NT':
102 # Win98 doesn't have unicode apis like FindFirstFileW
103 # TODO: We possibly could support Win98 by falling back to the
104@@ -1783,8 +1781,7 @@
105 _selected_dir_reader = Win32ReadDir()
106 except ImportError:
107 pass
108- elif fs_encoding in ('UTF-8', 'US-ASCII', 'ANSI_X3.4-1968'):
109- # ANSI_X3.4-1968 is a form of ASCII
110+ elif _fs_enc in ('utf-8', 'ascii'):
111 try:
112 from bzrlib._readdir_pyx import UTF8DirReader
113 _selected_dir_reader = UTF8DirReader()
114
115=== modified file 'bzrlib/tests/blackbox/test_exceptions.py'
116--- bzrlib/tests/blackbox/test_exceptions.py 2011-09-16 20:42:31 +0000
117+++ bzrlib/tests/blackbox/test_exceptions.py 2011-12-14 17:09:40 +0000
118@@ -60,6 +60,14 @@
119 flags=re.MULTILINE)
120 self.assertEquals(out, "")
121
122+ def test_utf8_default_fs_enc(self):
123+ """In the C locale bzr treats a posix filesystem as UTF-8 encoded"""
124+ if os.name != "posix":
125+ raise tests.TestNotApplicable("Needs system beholden to C locales")
126+ out, err = self.run_bzr_subprocess(["init", "file:%C2%A7"],
127+ env_changes={"LANG": "C", "LC_ALL": "C"})
128+ self.assertContainsRe(out, "^Created a standalone tree .*$")
129+
130
131 class TestOptParseBugHandling(tests.TestCase):
132 "Test that we handle http://bugs.python.org/issue2931"
133
134=== modified file 'bzrlib/tests/test_osutils.py'
135--- bzrlib/tests/test_osutils.py 2011-12-02 12:49:48 +0000
136+++ bzrlib/tests/test_osutils.py 2011-12-14 17:09:40 +0000
137@@ -1217,7 +1217,7 @@
138 self.requireFeature(UTF8DirReaderFeature)
139 self._save_platform_info()
140 win32utils.winver = None # Avoid the win32 detection code
141- osutils._fs_enc = 'UTF-8'
142+ osutils._fs_enc = 'utf-8'
143 self.assertDirReaderIs(
144 UTF8DirReaderFeature.module.UTF8DirReader)
145
146@@ -1225,22 +1225,14 @@
147 self.requireFeature(UTF8DirReaderFeature)
148 self._save_platform_info()
149 win32utils.winver = None # Avoid the win32 detection code
150- osutils._fs_enc = 'US-ASCII'
151- self.assertDirReaderIs(
152- UTF8DirReaderFeature.module.UTF8DirReader)
153-
154- def test_force_walkdirs_utf8_fs_ANSI(self):
155- self.requireFeature(UTF8DirReaderFeature)
156- self._save_platform_info()
157- win32utils.winver = None # Avoid the win32 detection code
158- osutils._fs_enc = 'ANSI_X3.4-1968'
159+ osutils._fs_enc = 'ascii'
160 self.assertDirReaderIs(
161 UTF8DirReaderFeature.module.UTF8DirReader)
162
163 def test_force_walkdirs_utf8_fs_latin1(self):
164 self._save_platform_info()
165 win32utils.winver = None # Avoid the win32 detection code
166- osutils._fs_enc = 'latin1'
167+ osutils._fs_enc = 'iso-8859-1'
168 self.assertDirReaderIs(osutils.UnicodeDirReader)
169
170 def test_force_walkdirs_utf8_nt(self):
171
172=== modified file 'doc/en/release-notes/bzr-2.5.txt'
173--- doc/en/release-notes/bzr-2.5.txt 2011-12-14 12:27:44 +0000
174+++ doc/en/release-notes/bzr-2.5.txt 2011-12-14 17:09:40 +0000
175@@ -36,6 +36,11 @@
176
177 * New HPSS call for ``Repository.reconcile``. (Jelmer Vernooij, #894455)
178
179+* Override the value returned by ``sys.getfilesystemencoding()`` for the bzr
180+ script to utf-8 when it would otherwise be ascii on a posix system. This
181+ will mean bzr works with non-ascii files when no locale or an incorrect
182+ locale is set. (Martin Packman, #794353)
183+
184 Bug Fixes
185 *********
186
187
188=== modified file 'doc/en/whats-new/whats-new-in-2.5.txt'
189--- doc/en/whats-new/whats-new-in-2.5.txt 2011-10-06 14:39:49 +0000
190+++ doc/en/whats-new/whats-new-in-2.5.txt 2011-12-14 17:09:40 +0000
191@@ -28,6 +28,13 @@
192 format is ``long``). This a work in progress and only some options are
193 supported so far.
194
195+Working on a posix system without a locale
196+******************************************
197+
198+Previously bzr needed a valid locale set to work with branches containing
199+non-ascii filenames. It will now use utf-8 rather than ascii as a fallback
200+encoding for interacting with the filesystem. This makes creating a working
201+tree and commiting to it possible for such branches in most environments.
202
203 Further information
204 *******************