Merge lp:~gz/bzr/no_locale_hacks_570495 into lp:bzr

Proposed by Martin Packman
Status: Merged
Approved by: Jelmer Vernooij
Approved revision: no longer in the source branch.
Merged at revision: 6427
Proposed branch: lp:~gz/bzr/no_locale_hacks_570495
Merge into: lp:bzr
Diff against target: 314 lines (+55/-144)
4 files modified
bzr (+2/-19)
bzrlib/osutils.py (+32/-73)
bzrlib/tests/test_osutils_encodings.py (+15/-52)
doc/en/release-notes/bzr-2.5.txt (+6/-0)
To merge this branch: bzr merge lp:~gz/bzr/no_locale_hacks_570495
Reviewer Review Type Date Requested Status
Jelmer Vernooij (community) Approve
Review via email: mp+86236@code.launchpad.net

Commit message

Simplify osutils.get_user_encoding and remove implicit setlocale calls from the function

Description of the change

Remove the need to hack around a Python bug on OSX and don't call setlocale from get_user_encoding to determine the character set. These are some quite large changes to a pretty central bzrlib function, and there may be fallout api users, but I'm keen to try and minimise that. It would be possible to extend the "treat ascii as utf-8" cludge to all posix systems, but if possible I'd like to avoid this and instead make the UI assume unicode output in fewer places, which is needed for other reasons regardless.

Not calling setlocale inside get_user_encoding is a good thing in the long run. Users of bzrlib that want localised output should be setting a valid locale at startup, rather than delaying till to an arbitrary later point and only affecting a setting that some parts of the program will use. We also need to sensibly handle output in cases where the base locale codeset is not UTF-8, which tends to get overlooked currently.

The deprecation of use_cache is mostly because this is a public function that will be around for a long time, and the implementation shouldn't be bound by an interface designed just for testability.

To post a comment you must log in.
Revision history for this message
Martin Packman (gz) wrote :

To enumerate the visible effects of these changes:

* The value returned will be normalised to the python codec name.
* Remove the fiddling with sys.platform which may help pypy a little.

OSX
===
* UTF-8 will still be used if LANG is unset, but LANG won't be touched.
* If LANG is C or another valid value that results in ascii, UTF-8 will be used,

Other posix
===========
* If setlocale is not already called (as it is in the bzr script), "ascii" will be used.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Hmm, I thought it'd already approved this.. looks like an email got lost somewhere.

Thanks for the extra explanation.

This seems reasonable in general. I still feel a bit uncertain about this, because there are possibly situations we're not taking into account that are broken by this change. Better to land this early in the 2.5 cycle than later though, so we can shake them out.

Revision history for this message
Jelmer Vernooij (jelmer) :
review: Approve
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Thanks for the extra explanation.

This seems reasonable, and the code is definitely an improvement.

That said, this is probably code that's tricky to get right everywhere, so let's land it early and pay close attention to bug reports in this area.

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzr'
2--- bzr 2012-01-03 11:38:01 +0000
3+++ bzr 2012-01-05 11:11:29 +0000
4@@ -41,26 +41,9 @@
5 profile_imports.install()
6 profiling = True
7
8-if sys.platform == 'darwin':
9- # jameinel says this hack is to force python to honor the LANG setting,
10- # even on Darwin. Otherwise it is apparently hardcoded to Mac-Roman,
11- # which is incorrect for the normal Terminal.app which wants UTF-8.
12- #
13- # "It might be that I should be setting the "system locale" somewhere else
14- # on the system, rather than setting LANG=en_US.UTF-8 in .bashrc.
15- # Switching to 'posix' and setting LANG worked for me."
16- #
17- # So we can remove this if someone works out the right way to tell Mac
18- # Python which encoding to use. -- mbp 20080703
19- sys.platform = 'posix'
20- try:
21- import locale
22- finally:
23- sys.platform = 'darwin'
24-else:
25+
26+if os.name == "posix":
27 import locale
28-
29-if os.name == "posix":
30 try:
31 locale.setlocale(locale.LC_ALL, '')
32 except locale.Error, e:
33
34=== modified file 'bzrlib/osutils.py'
35--- bzrlib/osutils.py 2012-01-03 17:11:56 +0000
36+++ bzrlib/osutils.py 2012-01-05 11:11:29 +0000
37@@ -28,6 +28,7 @@
38 lazy_import(globals(), """
39 from datetime import datetime
40 import getpass
41+import locale
42 import ntpath
43 import posixpath
44 import select
45@@ -54,8 +55,11 @@
46 """)
47
48 from bzrlib.symbol_versioning import (
49+ DEPRECATED_PARAMETER,
50 deprecated_function,
51 deprecated_in,
52+ deprecated_passed,
53+ warn as warn_deprecated,
54 )
55
56 from hashlib import (
57@@ -1976,76 +1980,53 @@
58 _cached_user_encoding = None
59
60
61-def get_user_encoding(use_cache=True):
62+def get_user_encoding(use_cache=DEPRECATED_PARAMETER):
63 """Find out what the preferred user encoding is.
64
65 This is generally the encoding that is used for command line parameters
66 and file contents. This may be different from the terminal encoding
67 or the filesystem encoding.
68
69- :param use_cache: Enable cache for detected encoding.
70- (This parameter is turned on by default,
71- and required only for selftesting)
72-
73 :return: A string defining the preferred user encoding
74 """
75 global _cached_user_encoding
76- if _cached_user_encoding is not None and use_cache:
77+ if deprecated_passed(use_cache):
78+ warn_deprecated("use_cache should only have been used for tests",
79+ DeprecationWarning, stacklevel=2)
80+ if _cached_user_encoding is not None:
81 return _cached_user_encoding
82
83- if sys.platform == 'darwin':
84- # python locale.getpreferredencoding() always return
85- # 'mac-roman' on darwin. That's a lie.
86- sys.platform = 'posix'
87- try:
88- if os.environ.get('LANG', None) is None:
89- # If LANG is not set, we end up with 'ascii', which is bad
90- # ('mac-roman' is more than ascii), so we set a default which
91- # will give us UTF-8 (which appears to work in all cases on
92- # OSX). Users are still free to override LANG of course, as
93- # long as it give us something meaningful. This work-around
94- # *may* not be needed with python 3k and/or OSX 10.5, but will
95- # work with them too -- vila 20080908
96- os.environ['LANG'] = 'en_US.UTF-8'
97- import locale
98- finally:
99- sys.platform = 'darwin'
100+ if os.name == 'posix' and getattr(locale, 'CODESET', None) is not None:
101+ # Use the existing locale settings and call nl_langinfo directly
102+ # rather than going through getpreferredencoding. This avoids
103+ # <http://bugs.python.org/issue6202> on OSX Python 2.6 and the
104+ # possibility of the setlocale call throwing an error.
105+ user_encoding = locale.nl_langinfo(locale.CODESET)
106 else:
107- import locale
108+ # GZ 2011-12-19: On windows could call GetACP directly instead.
109+ user_encoding = locale.getpreferredencoding(False)
110
111 try:
112- user_encoding = locale.getpreferredencoding()
113- except locale.Error, e:
114- sys.stderr.write('bzr: warning: %s\n'
115- ' Could not determine what text encoding to use.\n'
116- ' This error usually means your Python interpreter\n'
117- ' doesn\'t support the locale set by $LANG (%s)\n'
118- " Continuing with ascii encoding.\n"
119- % (e, os.environ.get('LANG')))
120- user_encoding = 'ascii'
121-
122- # Windows returns 'cp0' to indicate there is no code page. So we'll just
123- # treat that as ASCII, and not support printing unicode characters to the
124- # console.
125- #
126- # For python scripts run under vim, we get '', so also treat that as ASCII
127- if user_encoding in (None, 'cp0', ''):
128- user_encoding = 'ascii'
129- else:
130- # check encoding
131- try:
132- codecs.lookup(user_encoding)
133- except LookupError:
134+ user_encoding = codecs.lookup(user_encoding).name
135+ except LookupError:
136+ if user_encoding not in ("", "cp0"):
137 sys.stderr.write('bzr: warning:'
138 ' unknown encoding %s.'
139 ' Continuing with ascii encoding.\n'
140 % user_encoding
141 )
142- user_encoding = 'ascii'
143-
144- if use_cache:
145- _cached_user_encoding = user_encoding
146-
147+ user_encoding = 'ascii'
148+ else:
149+ # Get 'ascii' when setlocale has not been called or LANG=C or unset.
150+ if user_encoding == 'ascii':
151+ if sys.platform == 'darwin':
152+ # OSX is special-cased in Python to have a UTF-8 filesystem
153+ # encoding and previously had LANG set here if not present.
154+ user_encoding = 'utf-8'
155+ # GZ 2011-12-19: Maybe UTF-8 should be the default in this case
156+ # for some other posix platforms as well.
157+
158+ _cached_user_encoding = user_encoding
159 return user_encoding
160
161
162@@ -2053,28 +2034,6 @@
163 return get_terminal_encoding()
164
165
166-_message_encoding = None
167-
168-
169-def get_message_encoding():
170- """Return the encoding used for messages
171-
172- While the message encoding is a general setting it should usually only be
173- needed for decoding system error strings such as from OSError instances.
174- """
175- global _message_encoding
176- if _message_encoding is None:
177- if os.name == "posix":
178- import locale
179- # This is a process-global setting that can change, but should in
180- # general just get set once at process startup then be constant.
181- _message_encoding = locale.getlocale(locale.LC_MESSAGES)[1]
182- else:
183- # On windows want the result of GetACP() which this boils down to.
184- _message_encoding = get_user_encoding()
185- return _message_encoding or "ascii"
186-
187-
188 def get_host_name():
189 """Return the current unicode host name.
190
191
192=== modified file 'bzrlib/tests/test_osutils_encodings.py'
193--- bzrlib/tests/test_osutils_encodings.py 2011-12-02 13:36:43 +0000
194+++ bzrlib/tests/test_osutils_encodings.py 2012-01-05 11:11:29 +0000
195@@ -17,9 +17,7 @@
196 """Tests for the osutils wrapper."""
197
198 import codecs
199-import errno
200 import locale
201-import os
202 import sys
203
204 from bzrlib import (
205@@ -170,70 +168,35 @@
206
207 def setUp(self):
208 TestCase.setUp(self)
209- self.overrideAttr(locale, 'getpreferredencoding')
210+ self.overrideAttr(osutils, '_cached_user_encoding', None)
211+ self.overrideAttr(locale, 'getpreferredencoding', self.get_encoding)
212+ self.overrideAttr(locale, 'CODESET', None)
213 self.overrideAttr(sys, 'stderr', StringIOWrapper())
214
215+ def get_encoding(self, do_setlocale=True):
216+ return self._encoding
217+
218 def test_get_user_encoding(self):
219- def f():
220- return 'user_encoding'
221-
222- locale.getpreferredencoding = f
223+ self._encoding = 'user_encoding'
224 fake_codec.add('user_encoding')
225- self.assertEquals('user_encoding',
226- osutils.get_user_encoding(use_cache=False))
227+ self.assertEquals('iso8859-1', # fake_codec maps to latin-1
228+ osutils.get_user_encoding())
229 self.assertEquals('', sys.stderr.getvalue())
230
231 def test_user_cp0(self):
232- def f():
233- return 'cp0'
234-
235- locale.getpreferredencoding = f
236- self.assertEquals('ascii', osutils.get_user_encoding(use_cache=False))
237+ self._encoding = 'cp0'
238+ self.assertEquals('ascii', osutils.get_user_encoding())
239 self.assertEquals('', sys.stderr.getvalue())
240
241 def test_user_cp_unknown(self):
242- def f():
243- return 'cp-unknown'
244-
245- locale.getpreferredencoding = f
246- self.assertEquals('ascii', osutils.get_user_encoding(use_cache=False))
247+ self._encoding = 'cp-unknown'
248+ self.assertEquals('ascii', osutils.get_user_encoding())
249 self.assertEquals('bzr: warning: unknown encoding cp-unknown.'
250 ' Continuing with ascii encoding.\n',
251 sys.stderr.getvalue())
252
253 def test_user_empty(self):
254 """Running bzr from a vim script gives '' for a preferred locale"""
255- def f():
256- return ''
257-
258- locale.getpreferredencoding = f
259- self.assertEquals('ascii', osutils.get_user_encoding(use_cache=False))
260+ self._encoding = ''
261+ self.assertEquals('ascii', osutils.get_user_encoding())
262 self.assertEquals('', sys.stderr.getvalue())
263-
264- def test_user_locale_error(self):
265- def f():
266- raise locale.Error, 'unsupported locale'
267-
268- locale.getpreferredencoding = f
269- self.overrideEnv('LANG', 'BOGUS')
270- self.assertEquals('ascii', osutils.get_user_encoding(use_cache=False))
271- self.assertEquals('bzr: warning: unsupported locale\n'
272- ' Could not determine what text encoding to use.\n'
273- ' This error usually means your Python interpreter\n'
274- ' doesn\'t support the locale set by $LANG (BOGUS)\n'
275- ' Continuing with ascii encoding.\n',
276- sys.stderr.getvalue())
277-
278-
279-class TestMessageEncoding(TestCase):
280- """Tests for getting the encoding used by system messages"""
281-
282- def test_get_message_encoding(self):
283- encoding_name = osutils.get_message_encoding()
284- "".decode(encoding_name) # should be a valid encoding name
285-
286- def test_get_message_encoding_decodes_strerror(self):
287- encoding_name = osutils.get_message_encoding()
288- for number, name in errno.errorcode.iteritems():
289- string = os.strerror(number)
290- string.decode(encoding_name)
291
292=== modified file 'doc/en/release-notes/bzr-2.5.txt'
293--- doc/en/release-notes/bzr-2.5.txt 2012-01-05 11:06:47 +0000
294+++ doc/en/release-notes/bzr-2.5.txt 2012-01-05 11:11:29 +0000
295@@ -107,6 +107,9 @@
296 a dictionary which is not a supported use case for the configuration
297 stacks). (Vincent Ladeuil, #908050)
298
299+* Stop altering ``sys.platform`` on OSX when initialising the locale.
300+ (Martin Packman, #570495)
301+
302 Documentation
303 *************
304
305@@ -126,6 +129,9 @@
306 * ``Repository.get_commit_builder`` now takes a ``config_stack``
307 rather than a ``config`` argument. (Jelmer Vernooij)
308
309+* Scripts using bzrlib should now ensure setlocale is called on posix
310+ platforms if they need a non-ascii user encoding. (Martin Packman)
311+
312 * Send formats now accept a new optional argument ``submit_branch``,
313 which can be None or a Branch object for the submit branch location.
314 (Jelmer Vernooij)