Ubuntu
unzip package

Default charsets handling for Windows archives in CJKV+th locale

Bug #1422290 reported by Aron Xu on 2015-02-16

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	unzip (Debian)	Fix Released	Unknown	debbugs #483290
	unzip (Ubuntu)	Triaged	Medium	Unassigned

Bug Description

With the current unzip package in Ubuntu, we need to specify charset explicitly to extract zip files sent from localized Windows systems.

For example zip files sent from Japanese localized Windows,
$ zipinfo -O CP932 sent-from-localized-windows.zip
$ unzip -O CP932 sent-from-localized-windows.zip

This method won't work for GUI application like file-roller, users do not have way to specify charset from GUI.

Attached branch adds default charsets handling for Windows archives in CJKV+th locale, inspired by Ubuntu Kylin way.

As a result of bug #580961, two options have been added as Ubuntu patch.
> -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
> -I CHARSET specify a character encoding for UNIX and other archives

Then Ubuntu Kylin added default encoding as environment variables for their distribution.
http://bazaar.launchpad.net/~ubuntukylin-members/ubuntukylin-default-settings/trunk/revision/171

Now as Ubuntu, we can go further by a better way:
- per user settings by their locales instead of global settings
- don't interfere in other locales by locale guard

I only add "-O", so no behavior change for zip files created on Ubuntu or other Linux/UNIX systems. This branch just handles zip file created on localized Windows system seamlessly.

charsets list is taken from:
https://msdn.microsoft.com/en-us/goglobal/bb964654
and
msdos/msdos.c in unzip package:
   1682 case 932: /* Japanese */
   1683 case 949: /* Korean */
   1684 case 936: /* Chinese, simple */
   1685 case 950: /* Chinese, traditional */
   1686 case 874: /* Thai */
   1687 case 1258: /* Vietnamese */

(Copied from @nobuto's branch description.)

See original description

Related branches

lp:~nobuto/ubuntu/vivid/unzip/fallback-encoding

Superseded for merging into lp:ubuntu/vivid/unzip

Mathieu Trudel-Lapierre: Needs Information on 2015-05-11

Sebastien Bacher: Needs Information on 2015-02-16

Aron Xu (community): Approve on 2015-02-15

lp:~nobuto/ubuntu/wily/unzip/fallback-encoding

On hold for merging into lp:ubuntu/wily/unzip

Steve Langasek: Needs Fixing on 2015-09-04

Aron Xu: Pending requested 2015-08-23

Aron Xu (happyaron) on 2015-02-16

Changed in unzip (Ubuntu):
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Aron Xu (happyaron) wrote on 2015-02-16:

Additional background:

On Windows, file names are encoded with different encoding for CJKV+th locales, while ZIP archive does not store file name encoding information. When decompressing the ZIP archive on system with another encoding (i.e. UTF-8 on Linux), the file names are garbage and those characters are replaced to ??? by unzip command. And in reality there is no concrete algorithm can detect encoding reliably, not mentioning file names are too short (so it becomes more unreliable, not like in browsers).

Upstream solution to this problem was documented in bug #580961 which is not a direct path that works for ordinary users, hence we are adding a -O switch to specify encoding for archives created on Windows as a locale hack in distribution.

Nobuto Murata (nobuto) on 2015-02-16

description:

updated

Bug Watch Updater (bug-watch-updater) on 2015-02-16

Changed in unzip (Debian):
status:	Unknown → Confirmed

Revision history for this message

Yuan Chao (yuanchao) wrote on 2015-02-26:

It would be nice to have some auto-detect mechanism on top of this locale fallback. For my personal case, most zip files that need to specify the encoding is not the same as my corresponding locale setting.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2015-02-27:

@Yuan,

My patch refers LC_CTYPE first, so you can specify different locale to LC_CTYPE and LC_MESSAGES for example. And of cource you can manually export UNZIP and ZIPINFO variables on your ~/.profile. I understand my patch is for short-term workaround.

FWIW, unar supports encoding autodetection, but unzip does not. You can see auto-dection result by:
$ sudo apt-get install unar
$ lsar -l -pe /PATH/TO/ZIPFILE
I'm not sure if file-roller supports unar backend or not.

Revision history for this message

Yuan Chao (yuanchao) wrote on 2015-02-28:

Dear @Nobuto,

I appreciate the patch work very much, but it simply doesn't fit my use case. Quite frequently, I get
zip files with CJK file names from zh_CN and ja_JP. (my environment is either zh_TW or en_US, the
later which is for office desktop PC) Changing LC_CTYPE to something other than UTF8 is definitely
*no good* here. This would generate more "monjikai" file names. Changing LC_MESSAGE is not
necessary since I can read either. Adding support in the GUI front-end is really needed here.

My use case may be not close to general end users. But I know many experienced users used to adopt
en_US locale to be able to use the type-n-search for launching applications. (surely this is another issue)

Another thing I met before is using CJKV in the archiving password. Not sure if this is (still) a problem?
(unrar and unzip)

Revision history for this message

Aron Xu (happyaron) wrote on 2015-02-28:

@yuanchao, with or without this trick, running unzip would lead to garbled file name for you, so I don't think this change would bother you that much like you describe, does it?

Revision history for this message

Yuan Chao (yuanchao) wrote on 2015-02-28:

Well, without this trick, the filenames could be recovered with 'convmv'. But with this trick, it would be scrambled further... Still I personally prefer an auto-detect plus this fallback or an option in the GUI, like file-roller.

Revision history for this message

Aron Xu (happyaron) wrote on 2015-03-01:

@yuanchao, you cannot recover file name when it's decompressing with unzip (because characters are replaced by question marks), but you can do that when using 7zip.

Revision history for this message

Yuan Chao (yuanchao) wrote on 2015-03-01:

This is from one of my machine running LUbuntu:

$ export |grep LANG
declare -x LANG="en_US.UTF-8"

$ export |grep LC
declare -x LC_ADDRESS="en_US.UTF-8"
declare -x LC_IDENTIFICATION="en_US.UTF-8"
declare -x LC_MEASUREMENT="en_US.UTF-8"
declare -x LC_MONETARY="en_US.UTF-8"
declare -x LC_NAME="en_US.UTF-8"
declare -x LC_NUMERIC="en_US.UTF-8"
declare -x LC_PAPER="en_US.UTF-8"
declare -x LC_TELEPHONE="en_US.UTF-8"
declare -x LC_TIME="en_US.UTF-8"

$ unzip -h
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...

Use the file from here: http://www1.axfc.net/uploader/Sc/so/325701.zip (passwd: backer) (CP932)

$ unzip celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/В╣ВщВчВдВ╟.ust
  inflating: celluloid/В╣ВщВчВдВ╟2Ф╘.ust
  inflating: celluloid/В╣ВщВчВдВ╟СхГTГrСOВйВч.ust

$ unzip -O cp932 celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/せるらうど.ust
  inflating: celluloid/せるらうど2番.ust
  inflating: celluloid/せるらうど大サビ前から.ust

$ unzip -O cp936 celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/偣傞傜偆偳.ust
  inflating: celluloid/偣傞傜偆偳2斣.ust
  inflating: celluloid/偣傞傜偆偳戝僒價慜偐傜.ust

$ unzip -O cp950 celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/��炤��.ust
  inflating: celluloid/��炤��2��.ust
  inflating: celluloid/��炤�Ǒ��T�r�O��.ust

Another file from here http://3jf.wodemo.com/file/310894 (CP936)

$ unzip -L 王妃.zip
Archive: 王妃.zip
inflating: ═їх·_a.ust
inflating: ═їх·_b.ust

$ unzip -O cp932 王妃.zip
Archive: 王妃.zip
inflating: ﾍ銈A.ust
inflating: ﾍ銈B.ust

$ unzip -O cp936 王妃.zip
Archive: 王妃.zip
inflating: 王妃_A.ust
inflating: 王妃_B.ust

$ unzip -O cp950 王妃.zip
Archive: 王妃.zip
inflating: 卼漦_A.ust
inflating: 卼漦_B.ust

Actually, not all the wrong cases map to illegal UTF8 string (question marks). I guess why an auto-detect is not so straight forward?

This is from one of my machine running LUbuntu:

$ export |grep LANG
declare -x LANG="en_US.UTF-8"

$ unzip -h
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...

Use the file from here: http://www1.axfc.net/uploader/Sc/so/325701.zip (passwd: backer) (CP932)

$ unzip celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/В╣ВщВчВдВ╟.ust  
  inflating: celluloid/В╣ВщВчВдВ╟2Ф╘.ust  
  inflating: celluloid/В╣ВщВчВдВ╟СхГTГrСOВйВч.ust

$ unzip -O cp932 celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/せるらうど.ust  
  inflating: celluloid/せるらうど2番.ust  
  inflating: celluloid/せるらうど大サビ前から.ust

$ unzip -O cp936 celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/偣傞傜偆偳.ust  
  inflating: celluloid/偣傞傜偆偳2斣.ust  
  inflating: celluloid/偣傞傜偆偳戝僒價慜偐傜.ust

$ unzip -O cp950 celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/�����炤��.ust  
  inflating: celluloid/�����炤��2��.ust  
  inflating: celluloid/�����炤�Ǒ��T�r�O����.ust

Another file from here  http://3jf.wodemo.com/file/310894   (CP936)

$ unzip -L 王妃.zip 
Archive:  王妃.zip
  inflating: ═їх·_a.ust         
  inflating: ═їх·_b.ust

$ unzip -O cp932 王妃.zip 
Archive:  王妃.zip
  inflating: ﾍ銈A.ust          
  inflating: ﾍ銈B.ust

$ unzip -O cp936 王妃.zip 
Archive:  王妃.zip
  inflating: 王妃_A.ust            
  inflating: 王妃_B.ust

$ unzip -O cp950 王妃.zip 
Archive:  王妃.zip
  inflating: 卼漦_A.ust            
  inflating: 卼漦_B.ust

Actually, not all the wrong cases map to illegal UTF8 string (question marks). I guess why an auto-detect is not so straight forward?

Revision history for this message

Nobuto Murata (nobuto) wrote on 2015-03-10:

@Yuan,

For example "王妃.zip" you posted, it has short file names in the archive. Even with unar/lsar it fails to detect encoding (you expect CP932, but lsar shows it's ISO-8859-8). Auto detection of encoding is not 100% reliable especially with short file names (less hints for encoding detector).
====
$ lsar -l -pe 王妃.zip
王妃.zip: Zip
     Flags File size Ratio Mode Date Time Name
     ===== ========== ===== ==== ========== ===== ====
  0. ----- 40344 82.9% Defl 2014-10-03 13:40 %cd%f5%e5%fa_A.ust
  1. ----- 20311 80.4% Defl 2014-10-03 13:40 %cd%f5%e5%fa_B.ust
(Flags: D=Directory, R=Resource fork, L=Link, E=Encrypted, @=Extended attributes)
(Mode: Defl=Deflate)
Encoding: ISO-8859-8 (76% confidence)
====

Anyway enabling auto-detection or specifying encoding in file-roller is out of scope of this bug report. You need to open separate bugs if needed. I would like to proceed with fallback setting in the attached branch for vivid.

Revision history for this message

Ikuya Awashiro (ikuya-fruitsbasket) wrote on 2015-03-22:

#10

Any progress?

Revision history for this message

Sebastien Bacher (seb128) wrote on 2015-05-18:

#11

It seems like there are no Ubuntu developers that feel like reviewing those changes, it would be good to get that reviewed upstream and/or in Debian...

Revision history for this message

Nobuto Murata (nobuto) wrote on 2015-07-08:

#12

Download full text (3.5 KiB)

I have sent an enhancement request to upstream through http://www.info-zip.org/zip-bug.html since the issue is still reproducible with 6.1c19-BETA which you can try from: https://launchpad.net/~nobuto/+archive/ubuntu/build-test/+build/7630500

Putting a copy of the request here for your reference.

====

This is an enhancement request. Thanks to ICONV_MAPPING(-O/-I options), we can specify character encoding when extracting zip files. However in combination with GUI application(e.g. file-roller on Linux), there is no way to specify -I/-O from a user perspective. Therefore We cannot extract zip files created on localized Windows system properly with GUI.

A workaround would be exporting UNZIP and ZIPINFO variables with "-O <local charset on Windows>" per locale on login by putting [1] under /etc/profile.d/.

[1] http://bazaar.launchpad.net/~nobuto/ubuntu/vivid/unzip/fallback-encoding/view/head:/debian/unzip-default-charset.sh

It would be nice if unzip had fallback charset mapping per locale out of the box. I have created a test case to handle 3 types of zip files in ja_JP locale.

[2] http://bazaar.launchpad.net/~nobuto/ubuntu/vivid/unzip/fallback-encoding/view/head:/debian/tests/fallback-encoding
(without [1], 3rd test case, fat and CP932, will fail.)

$ unzip -v
UnZip 6.1c19-BETA (2015-04-15) by Info-ZIP. Maintainer: Steven M. Schweda
Copyright (c) 1990-2015 Info-ZIP. For software license: unzip --license
See README for details. More info: http://info-zip.org/UnZip.html

Compiled with GCC 4.9.2 for Unix (GNU/Linux x86_64).

UnZip special compilation options:
ARCHIVE_STDIN ICONV_MAPPING IZ_HAVE_UXUIDGID SET_DIR_ATTRIB SYMLINKS TIMESTAMP UNIXBACKUP USE_EF_UT_TIME UNSHRINK_SUPPORT DEFLATE64_SUPPORT UNICODE_ MBCS-support LARGE_FILE_SUPPORT ZIP64_SUPPORT BZIP2_SUPPORT LZMA_SUPPORT PPMD_SUPPORT VMS_TEXT_CONV IZ_CRYPT_TRAD (Allow streaming archive from stdin)
(ISO/OEM (iconv, -I/-O) conversion supported)
(UID, GID > 16-bit ("ux" extra block) supported)
(Setting directory attributes supported)
(Symbolic links supported, if RTL and file sys do)
(Restoring file timestamps supported)
(-B creates backup files)
(Use Universal Time, if available)
(PKZIP/Zip 1.x Shrink compression)
(PKZIP 4.x Deflate64(tm) compression)
/>SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8 paths)
(Multibyte character support, MB_CUR_MAX = 6)
(Large files over 2 GiB supported)
(Archives using Zip64 for large files supported)
(PKZIP 4.6+, bzip2 lib ver 1.0.6, 6-Sept-2010)
(PKZIP 6.3+, LZMA compression, ver 9.20)
(PKZIP 6.3+, PPMd compression, ver 9.20)
(Conversion of VMS var-len rec fmt text supported)
(Traditional (weak) encryption, ver 3.0)

Putting a copy of the request here for your reference.

====

A workaround would be exporting UNZIP and ZIPINFO variables with "-O <local charset on Windows>" per locale on login by putting [1] under /etc/profile.d/.

[1] http://bazaar.launchpad.net/~nobuto/ubuntu/vivid/unzip/fallback-encoding/view/head:/debian/unzip-default-charset.sh

It would be nice if unzip had fallback charset mapping per locale out of the box. I have created a test case to handle 3 types of zip files in ja_JP locale.

[2] http://bazaar.launchpad.net/~nobuto/ubuntu/vivid/unzip/fallback-encoding/view/head:/debian/tests/fallback-encoding
(without [1], 3rd test case, fat and CP932, will fail.)

$ unzip -v 
UnZip 6.1c19-BETA (2015-04-15) by Info-ZIP.  Maintainer: Steven M. Schweda
 Copyright (c) 1990-2015 Info-ZIP.  For software license: unzip --license
 See README for details.  More info: http://info-zip.org/UnZip.html

Compiled with GCC 4.9.2 for Unix (GNU/Linux x86_64).

UnZip special compilation options:
        ARCHIVE_STDIN        (Allow streaming archive from stdin)
        ICONV_MAPPING        (ISO/OEM (iconv, -I/-O) conversion supported)
        IZ_HAVE_UXUIDGID     (UID, GID > 16-bit ("ux" extra block) supported)
        SET_DIR_ATTRIB       (Setting directory attributes supported)
        SYMLINKS             (Symbolic links supported, if RTL and file sys do)
        TIMESTAMP            (Restoring file timestamps supported)
        UNIXBACKUP           (-B creates backup files)
        USE_EF_UT_TIME       (Use Universal Time, if available)
        UNSHRINK_SUPPORT     (PKZIP/Zip 1.x Shrink compression)
        DEFLATE64_SUPPORT    (PKZIP 4.x Deflate64(tm) compression)
        UNICODE_SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8 paths)
        MBCS-support         (Multibyte character support, MB_CUR_MAX = 6)
        LARGE_FILE_SUPPORT   (Large files over 2 GiB supported)
        ZIP64_SUPPORT        (Archives using Zip64 for large files supported)
        BZIP2_SUPPORT        (PKZIP 4.6+, bzip2 lib ver 1.0.6, 6-Sept-2010)
        LZMA_SUPPORT         (PKZIP 6.3+, LZMA compression, ver 9.20)
        PPMD_SUPPORT         (PKZIP 6.3+, PPMd compression, ver 9.20)
        VMS_TEXT_CONV        (Conversion of VMS var-len rec fmt text supported)
        IZ_CRYPT_TRAD        (Traditional (weak) encryption, ver 3.0)

Traditional Zip Encryption notice:
        The traditional zip encryption code of this program is not
        copyrighted, and is put in the public domain.  It was originally
        written in Europe, and, to the best of our knowledge, can be freely
        distributed in both source and object forms from any country,
        including the USA under License Exception TSU of the U.S. Export
        Administration Regulations (section 740.13(e)) of 6 June 2002.

UnZip and ZipInfo environment options:
           UNZIP:  [none]
        UNZIPOPT:  [none]
         ZIPINFO:  [none]
      ZIPINFOOPT:  [none]

Revision history for this message

Iain Lane (laney) wrote on 2015-09-02:

#13

Did upstream say anything?

What is "GBK" that Kylin uses and why is it different from the one we have here?

Sorry for being clueless. :)

Revision history for this message

Aron Xu (happyaron) wrote on 2015-09-04:

#14

Upstream won't apply such a behavior as they regard it as locale hacks.

GBK is a superset of cp936 but is not too big to cover portions of UTF-8 (so it can be reliably detected, not like GB18030). It's better to use GBK than cp936 from this POV.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2015-09-04:

#15

> Did upstream say anything?

I've got a reply from a developer of unzip, but he is also not familiar with those charset issues. I need to discuss it more in upstream. However what I'm trying to do here is a relatively short-term solution. I believe the request in the attached branch is still valid as downstream to workaround real-life problem which users are seeing in daily-use (as non-latin charset users).

> What is "GBK" that Kylin uses and why is it different from the one we have here?

I took the charset list from:
https://msdn.microsoft.com/en-us/goglobal/bb964654
and
msdos/msdos.c in unzip package:
   1682 case 932: /* Japanese */
   1683 case 949: /* Korean */
   1684 case 936: /* Chinese, simple */
   1685 case 950: /* Chinese, traditional */
   1686 case 874: /* Thai */
   1687 case 1258: /* Vietnamese */

I'm not so familiar with Chinese charset. I thought CP936 was suitable because we were trying to solve the issue with localized Windows made zip files. GBK may have wider coverage than CP936 though.

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-09-04:

#16

Followed up on https://code.launchpad.net/~nobuto/ubuntu/wily/unzip/fallback-encoding/+merge/268850 with some feedback about the patch. There are better ways to achieve this than through profile.d.

Revision history for this message

Alberto Salvia Novella (es20490446e) wrote on 2015-09-18:

#17

Bug #1462848 could be a duplicate.

Revision history for this message

Unxed (unxed) wrote on 2020-06-23:

#18

Wrote a patch for unzip fixing this issue:
https://sourceforge.net/p/infozip/patches/29/

The same patch for p7zip:
https://sourceforge.net/p/p7zip/bugs/187/

Bug Watch Updater (bug-watch-updater) on 2021-08-21

Changed in unzip (Debian):
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

debbugs #483290
[done normal upstream patch] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntuunzip package

Default charsets handling for Windows archives in CJKV+th locale

Bug Description

Related branches

Other bug subscribers

Remote bug watches

Ubuntu
unzip package