Bug #666565 ""utf8” charmap in locale name is wrong” : Bugs : language-selector package : Ubuntu

Revision history for this message

Lauri Tirkkonen (lotheac) wrote on 2010-10-26:

#1

Replace '.utf8' with 'UTF-8' in generated locale strings. Edit (1.8 KiB, text/plain)

Revision history for this message

Aron Xu (happyaron) wrote on 2011-01-25:

#2

The problem of "utf8" and "UTF-8" has been there for some time, and there were arguments about it. Let's see:

$ locale -a
C
POSIX
zh_CN.utf8
zh_SG.utf8

# locale-gen
Generating locales...
zh_CN.UTF-8... up-to-date
zh_SG.UTF-8... up-to-date
Generation complete.

The problem is that we have mixed up the use of "utf8" and "UTF-8", and I think language-selector isn't the root of such problem - it should be a lower level one.

Many Ubuntu developers tend to use utf8 instead of UTF-8 because it is "defined in eglibc and UTF-8 is now an alias". But we may consider it a bug in the eglibc package in Ubuntu. It has broke many things, many people are confused, work become more complicated. While, I don't mean changing it to UTF-8 can solve all problems, but it might be the right thing.

Changed in ubuntu-translations:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-01-25: Re: [Bug 666565] Re: "utf8" charmap in locale name is wrong

#3

But "utf8" has been the canonical form in eglibc for as long as I can
remember (at least ten years or so I believe). This isn't something
specific to Ubuntu. Changing it seems risky.

Revision history for this message

Aron Xu (happyaron) wrote on 2011-01-25:

#4

Yes, changing is risky. An alternative option is "fix" this in langpack-locale, and try to make everywhere in the system to use "utf8" if any problems occur. A temporary solution to use both utf8 and UTF-8 is of course needed, and it should be just a work around. Such problem tends to cost more and more time to fix issues when users need to change locale settings, and the complexity of dealing related problems are now much higher than ever before.

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-01-25:

#5

I still believe that the best option is to use UTF-8 as the primary
user-visible name in environment variables and such (since it's what's
in /usr/share/i18n/SUPPORTED), even though it's an alias, but to fix the
small handful of things that have trouble when you use one of the other
valid spellings. It's not that hard - I've implemented software that
does it. The vast majority of locale-aware software doesn't need to
care, because it will just do setlocale(LC_ALL, "") and get on with
things regardless of whether an alias is in use. It's only a tiny
minority of software that does more sophisticated things with locale
strings that needs to care.

The reason to take this approach is that software that parses locale
strings in ways that only handle particular spellings of them tend to be
buggy in other ways. For example, such buggy software can easily fail
to handle LANG=en_IN as a UTF-8 locale, even though it's defined as such
in /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
with locales that previously had a non-UTF-8 version, and some newer
locales just went UTF-8 from the start). This sort of thing is easily
fixed by (for example) using nl_langinfo(CODESET) rather than trying to
parse locale strings.

Fundamentally, locale strings are supposed to be opaque, and anything
that parses them had better (a) have a good excuse and (b) read the
documentation very carefully to understand what it can and can't do.

Getting back to the original patch, the general idea seems OK to me, but
I think it would be helpful for it to take a slightly different approach
to implementation. Rather than just appending .UTF-8, I suggest
searching /usr/share/i18n/SUPPORTED for a suitable match for the
language, country, and variant which has "UTF-8" as the second column.
That way, language-selector will always select the canonical
user-visible name for the locale, even if it's one of the interesting
cases such as en_IN where the canonical name doesn't have an encoding
suffix.

I still believe that the best option is to use UTF-8 as the primary
user-visible name in environment variables and such (since it's what's
in /usr/share/i18n/SUPPORTED), even though it's an alias, but to fix the
small handful of things that have trouble when you use one of the other
valid spellings.  It's not that hard - I've implemented software that
does it.  The vast majority of locale-aware software doesn't need to
care, because it will just do setlocale(LC_ALL, "") and get on with
things regardless of whether an alias is in use.  It's only a tiny
minority of software that does more sophisticated things with locale
strings that needs to care.

The reason to take this approach is that software that parses locale
strings in ways that only handle particular spellings of them tend to be
buggy in other ways.  For example, such buggy software can easily fail
to handle LANG=en_IN as a UTF-8 locale, even though it's defined as such
in /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
with locales that previously had a non-UTF-8 version, and some newer
locales just went UTF-8 from the start).  This sort of thing is easily
fixed by (for example) using nl_langinfo(CODESET) rather than trying to
parse locale strings.

Fundamentally, locale strings are supposed to be opaque, and anything
that parses them had better (a) have a good excuse and (b) read the
documentation very carefully to understand what it can and can't do.

Getting back to the original patch, the general idea seems OK to me, but
I think it would be helpful for it to take a slightly different approach
to implementation.  Rather than just appending .UTF-8, I suggest
searching /usr/share/i18n/SUPPORTED for a suitable match for the
language, country, and variant which has "UTF-8" as the second column.
That way, language-selector will always select the canonical
user-visible name for the locale, even if it's one of the interesting
cases such as en_IN where the canonical name doesn't have an encoding
suffix.

Revision history for this message

Lauri Tirkkonen (lotheac) wrote on 2011-01-25:

#6

Colin's right, of course -- my main issue with this isn't software running locally, but remote systems. That's not trivial though: ssh into some legacy machine, and they might not have compiled your locale at all, or perhaps it's different (such as the case with en_IN). Of course, that's not an Ubuntu bug, but rather a problem with POSIX not separating charmaps from locales.

Brian Murray (brian-murray) on 2011-01-25

tags:

added: patch

Revision history for this message

kk19881201 (kk19881201) wrote on 2011-01-28:

#7

When I was using GVIM in China , it can not properly display Chinese characters. It can only recognizes UTF-8, not utf8.

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-01-28:

#8

Then that's a vim bug - I've opened a task for it.

Revision history for this message

Aron Xu (happyaron) wrote on 2011-01-28:

#9

No, it wouldn't be. I think any application that doesn't work with .UTF-8 should be a bug, but not for it doesn't work with .utf8.

Revision history for this message

Aron Xu (happyaron) wrote on 2011-01-28:

#10

Referring to gettext document, which might be not really a standard but shows their attitude about locale, they give .UTF-8 as example, but not mentioning .utf8 at all.
http://www.gnu.org/software/hello/manual/gettext/Locale-Names.html#Locale-Names

I didn't do detailed research, but in some other major distribution they use .UTF-8 in their official documentations, e.g. http://www.gentoo.org/doc/en/utf-8.xml

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-01-28:

#11

On Fri, Jan 28, 2011 at 12:13:56PM -0000, Aron Xu wrote:
> No, it wouldn't be. I think any application that doesn't work with
> .UTF-8 should be a bug, but not for it doesn't work with .utf8.

I entirely disagree. .utf8 is a valid spelling of the locale and it's a
bug for applications to fail to work with it.

Revision history for this message

ZhengPeng Hou (zhengpeng-hou) wrote on 2011-01-28:

#12

if we shift to use utf8, what about other distro still use UTF-8? are
we going to ignore the interoperability. In addition, how can we
convince all other user space applications adopt utf8?
I found that UTF-8 is till being used in eglibc, so whats the
advantage to use utf8?

On Fri, Jan 28, 2011 at 8:36 PM, Colin Watson <email address hidden> wrote:
> On Fri, Jan 28, 2011 at 12:13:56PM -0000, Aron Xu wrote:
>> No, it wouldn't be. I think any application that doesn't work with
>> .UTF-8 should be a bug, but not for it doesn't work with .utf8.
>
> I entirely disagree. .utf8 is a valid spelling of the locale and it's a
> bug for applications to fail to work with it.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/666565
>
> Title:
> "utf8" charmap in locale name is wrong
>
> Status in Ubuntu Translations:
> Triaged
> Status in “eglibc” package in Ubuntu:
> New
> Status in “langpack-locales” package in Ubuntu:
> New
> Status in “language-selector” package in Ubuntu:
> New
> Status in “vim” package in Ubuntu:
> New
>
> Bug description:
> Binary package hint: language-selector
>
> LanguageSelector/macros.py explicitly sets the charmap part of locale
> strings to "utf8" - it should be set to "UTF-8" instead. This is
> relevant because not all systems alias locale names with the former to
> the latter, and compatibility with those systems is broken.
>
> Rationale for this change is that the 'locales' package uses the uppercase hyphenated format everywhere, even going as far as replacing '.utf8' with it in one case:
> % dpkg -L locales | xargs grep '\.utf8'
> /usr/sbin/locale-gen: elif [ $IS_LANG = no ] && L=`grep "^${1/%.utf8/.UTF-8} " /usr/share/i18n/SUPPORTED`; then
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu-translations/+bug/666565/+subscribe
>

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-01-28:

#13

I am not saying that we should shift to use .utf8. I am saying that
when the locale ends up as .utf8 for one reason or another, applications
must not break.

This does not have to be an either/or thing! The primary name for the
locales are still generally .UTF-8 and should remain that way. But
locale aliases exist and it's only a small number of buggy applications
that fail to cope with them.

Revision history for this message

Aron Xu (happyaron) wrote on 2011-01-28:

#14

If we fix it for users and applications, say, the only visible one for them is .UTF-8, then we can avoid many issues to deal with. It also improves the compatibility when people connect to other distros (like via ssh).

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-01-28:

#15

I agree that it makes sense to present UTF-8 as the primary spelling.
Note that I already said above that I agreed that language-selector
should be fixed. But it is clearly wrong to deny the existence of
locale aliases.

Revision history for this message

Gunnar Hjalmarsson (gunnarhj) wrote on 2011-02-03:

#16

As from version 2.32.0-0ubuntu2, gdm (ubuntu) may assign locale name to LC_MESSAGES.

Changed in gdm (Ubuntu):
assignee:	nobody → Gunnar Hjalmarsson (gunnarhj)

Revision history for this message

Gunnar Hjalmarsson (gunnarhj) wrote on 2011-02-03:

#17

Download full text (3.8 KiB)

There seems to be a consensus of opinion that the encoding part of
locale names, that are assigned to the LANG or LC_* environment
variables, should be .UTF-8 rather than .utf8. I'm currently working on
language-selector and GDM with other language/locale related matters, so
I can include the necessary changes in a couple of merge proposals in
pipeline. Before I do so, and since I don't have an own idea on to which
extent the changes would create new issues, I'd like that someone
triages the bug with respect to language-selector and gdm (ubuntu). I'd
also need help to draw a conclusion from the reasoning below.

On 2011-01-25 12:54, Colin Watson wrote:
> ... software that parses locale strings in ways that only handle
> particular spellings of them tend to be buggy in other ways. For
> example, such buggy software can easily fail to handle LANG=en_IN as
> a UTF-8 locale, even though it's defined as such in
> /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
> with locales that previously had a non-UTF-8 version, and some newer
> locales just went UTF-8 from the start).

Doesn't that point towards simply appending .UTF-8 to e.g. en_IN,
irrespective of the name according to /usr/share/i18n/SUPPORTED?

I did this test:

  [gunnar@gunnar-laptop ~/sandbox]$ sh
  $ cat mytest.po
  msgid "hello"
  msgstr "hello from India"
  $ dir=/usr/share/locale/en_IN/LC_MESSAGES
  $ sudo mkdir -p $dir
  $ sudo msgfmt mytest.po -o $dir/mytest.mo
  $ LANGUAGE=''
  $ LC_MESSAGES=en_IN
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.utf8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.UTF-8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ exit
  [gunnar@gunnar-laptop ~/sandbox]$

No complaints, and gettext found the Indian 'translation' in all three
cases, so en_IN.UTF-8 seems to work. Or would that name cause other apps
to fail?

> Getting back to the original patch, the general idea seems OK to me,
> but I think it would be helpful for it to take a slightly different
> approach to implementation. Rather than just appending .UTF-8, I
> suggest searching /usr/share/i18n/SUPPORTED for a suitable match for
> the language, country, and variant which has "UTF-8" as the second
> column. That way, language-selector will always select the canonical
> user-visible name for the locale, even if it's one of the
> interesting cases such as en_IN where the canonical name doesn't have
> an encoding suffix.

Even if we would go for the canonical names, I don't think it's
necessary to parse /usr/share/i18n/SUPPORTED.

  [gunnar@gunnar-laptop ~]$ locale -a | grep -F en_IN
  en_IN
  en_IN.utf8
  [gunnar@gunnar-laptop ~]$

As you can see, the special case en_IN is represented by two items in
the 'locale -a' output. We ought to be able to make use of that info.

This example shows how the English locale names might be grabbed:

There seems to be a consensus of opinion that the encoding part of
locale names, that are assigned to the LANG or LC_* environment
variables, should be .UTF-8 rather than .utf8. I'm currently working on
language-selector and GDM with other language/locale related matters, so
I can include the necessary changes in a couple of merge proposals in
pipeline. Before I do so, and since I don't have an own idea on to which
extent the changes would create new issues, I'd like that someone
triages the bug with respect to language-selector and gdm (ubuntu). I'd
also need help to draw a conclusion from the reasoning below.

On 2011-01-25 12:54, Colin Watson wrote:
> ... software that parses locale strings in ways that only handle
> particular spellings of them tend to be buggy in other ways.  For
> example, such buggy software can easily fail to handle LANG=en_IN as
> a UTF-8 locale, even though it's defined as such in
> /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
> with locales that previously had a non-UTF-8 version, and some newer
> locales just went UTF-8 from the start).

Doesn't that point towards simply appending .UTF-8 to e.g. en_IN,
irrespective of the name according to /usr/share/i18n/SUPPORTED?

I did this test:

[gunnar@gunnar-laptop ~/sandbox]$ sh
  $ cat mytest.po
  msgid "hello"
  msgstr "hello from India"
  $ dir=/usr/share/locale/en_IN/LC_MESSAGES
  $ sudo mkdir -p $dir
  $ sudo msgfmt mytest.po -o $dir/mytest.mo
  $ LANGUAGE=''
  $ LC_MESSAGES=en_IN
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.utf8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.UTF-8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ exit
  [gunnar@gunnar-laptop ~/sandbox]$

No complaints, and gettext found the Indian 'translation' in all three
cases, so en_IN.UTF-8 seems to work. Or would that name cause other apps
to fail?

> Getting back to the original patch, the general idea seems OK to me,
> but I think it would be helpful for it to take a slightly different
> approach to implementation.  Rather than just appending .UTF-8, I
> suggest searching /usr/share/i18n/SUPPORTED for a suitable match for
> the language, country, and variant which has "UTF-8" as the second
> column. That way, language-selector will always select the canonical
> user-visible name for the locale, even if it's one of the
> interesting cases such as en_IN where the canonical name doesn't have
> an encoding suffix.

Even if we would go for the canonical names, I don't think it's
necessary to parse /usr/share/i18n/SUPPORTED.

[gunnar@gunnar-laptop ~]$ locale -a | grep -F en_IN
  en_IN
  en_IN.utf8
  [gunnar@gunnar-laptop ~]$

As you can see, the special case en_IN is represented by two items in
the 'locale -a' output. We ought to be able to make use of that info.

This example shows how the English locale names might be grabbed:

[gunnar@gunnar-laptop ~]$ sh
  $ tmp=$( locale -a | grep -xvE C\|POSIX )
  $ no_enc=$( echo "$tmp" | grep -vF .utf8 )
  $ for locale in $( echo "$tmp" | grep -F .utf8 | sed 's/\.utf8//' )
  > do
  >     if ! expr $locale : en > /dev/null ; then
  >         continue
  >     elif expr "$no_enc" : .*$locale > /dev/null ; then
  >         echo $locale
  >     else
  >         echo $( echo $locale | sed -r 's/([^@]+)/\1.UTF-8/' )
  >     fi
  > done
  en_AG
  en_AU.UTF-8
  en_BW.UTF-8
  en_CA.UTF-8
  en_DK.UTF-8
  en_GB.UTF-8
  en_HK.UTF-8
  en_IE.UTF-8
  en_IN
  en_NG
  en_NZ.UTF-8
  en_PH.UTF-8
  en_SG.UTF-8
  en_US.UTF-8
  en_ZA.UTF-8
  en_ZW.UTF-8
  $ exit
  [gunnar@gunnar-laptop ~]$

As you can see, English locale names for Antigua/Barbuda and Nigeria are
the same kind of special cases as en_IN.

Changed in language-selector (Ubuntu):
assignee:	nobody → Gunnar Hjalmarsson (gunnarhj)

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-02-11:

#18

I prefer the canonical names myself (i.e. en_IN rather than
en_IN.UTF-8), but either should be OK.

Parsing /usr/share/i18n/SUPPORTED is *easier* than parsing the
output of 'locale -a', and I think it's safer than trying to draw
inferences from details of the latter's output.

Revision history for this message

Gunnar Hjalmarsson (gunnarhj) wrote on 2011-02-11:

#19

Copied from #ubuntu-devel, for the record:

Gunnar Hjalmarsson:
Thanks! Then, how about just replacing .utf8 with .UTF-8 to start with, and introduce parsing of .../SUPPORTED later on, if the simplistic solution proves to not suffice?

Colin Watson:
it would likely be an improvement, at least

Gunnar Hjalmarsson:
Ok, then I include .utf8 => .UTF-8 in a couple of MPs, so we get it confirmed that it's an improvement, to start with.

Changed in language-selector (Ubuntu):
status:	New → In Progress
Changed in gdm (Ubuntu):
status:	New → In Progress

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-02-14:

#20

This bug was fixed in the package gdm - 2.32.0-0ubuntu8

---------------
gdm (2.32.0-0ubuntu8) natty; urgency=low

  [ Gunnar Hjalmarsson ]
  * debian/patches/36_language_environment_settings.patch:
    - Use locale names with '.UTF-8' instead of '.utf8' when setting
      the LC_MESSAGES environment variable (LP: #666565).
  * debian/patches/40_one_lang_option_per_translation.patch:
    - Modification of /usr/share/gdm/language-options so an absent
      translation directory won't cause it to exit.
-- Evan Dandrea <email address hidden> Mon, 14 Feb 2011 15:53:38 +0000

Changed in gdm (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-02-14:

#21

This bug was fixed in the package language-selector - 0.13

---------------
language-selector (0.13) natty; urgency=low

  [ Gunnar Hjalmarsson ]
  * LanguageSelector/gtk/GtkLanguageSelector.py:
    - Ensure that main or origin country is included when country
      specific options for a language are shown (LP: #710148).
    - Do not let an absent translation directory make the program crash
      (LP: #714093).
  * data/LanguageSelector.ui:
    - Shorter label to describe the second tab (LP: #709855).
  * LanguageSelector/macros.py:
    - Use locale names with '.UTF-8' instead of '.utf8' when setting
      LC_* or LANG environment variables (LP: #666565, #700619).
      Thanks to Lauri Tirkkonen for the patch!
-- Evan Dandrea <email address hidden> Mon, 14 Feb 2011 16:13:04 +0000

Changed in language-selector (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

Gunnar Hjalmarsson (gunnarhj) wrote on 2011-03-07:

#22

Fixes of this bug for Lucid and Maverick are now available in official backports packages. To make Synaptic check for backports updates you can do:

o System -> Administration -> Update Manager -> Settings...

o Select the "Updates" tab and check the "Unsupported updates" option.

More about Ubuntu backports:
https://help.ubuntu.com/community/UbuntuBackports

Revision history for this message

Martin Pitt (pitti) wrote on 2011-11-16:

#23

Not going to apply a large Ubuntu specific patch for this in langpack-locales. This should get fixed in upstream glibc or not at all IMHO.

Changed in langpack-locales (Ubuntu):
status:	New → Won't Fix

Revision history for this message

David Planella (dpm) wrote on 2012-10-18:

#24

From the latest comments, I'm unsure about the status. Is there anything else needed to fix this bug?

Changed in ubuntu-translations:
status:	Triaged → Incomplete

David Planella (dpm) on 2012-10-18

Changed in ubuntu-translations:
importance:	High → Low

Revision history for this message

Lauri Tirkkonen (lotheac) wrote on 2012-10-18:

#25

This was fixed in language-selector, which is what I originally reported it against. I'm not sure why it's marked as affecting ubuntu-translations.

Revision history for this message

David Planella (dpm) wrote on 2012-10-18:

#26

We track all i18n and l10n bugs under the ubuntu-translations project to have a better oversight on them. Thanks a lot for the feedback, marked it as Fix Released there.

Changed in ubuntu-translations:
status:	Incomplete → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-06-07:

#27

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in eglibc (Ubuntu):
status:	New → Confirmed
Changed in vim (Ubuntu):
status:	New → Confirmed

Ubuntu
language-selector package

"utf8" charmap in locale name is wrong

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

	Status	Importance	Assigned to
Ubuntu Translations	Fix Released	Low	Unassigned
eglibc (Ubuntu)	Confirmed	Undecided	Unassigned
gdm (Ubuntu)	Fix Released	Undecided	Gunnar Hjalmarsson
langpack-locales (Ubuntu)	Won't Fix	Undecided	Unassigned
language-selector (Ubuntu)	Fix Released	Undecided	Gunnar Hjalmarsson
vim (Ubuntu)	Confirmed	Undecided	Unassigned

Ubuntulanguage-selector package

"utf8" charmap in locale name is wrong

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntu
language-selector package