Merge lp:~saviq/elisa/subtitle_chardet into lp:elisa

Proposed by Michał Sawicz
Status: Needs review
Proposed branch: lp:~saviq/elisa/subtitle_chardet
Merge into: lp:elisa
Diff against target: 123 lines (+70/-4)
3 files modified
elisa-core/README (+1/-0)
elisa-plugins/elisa/plugins/poblesec/player_video.py (+68/-4)
elisa-plugins/elisa/plugins/poblesec/setup.py (+1/-0)
To merge this branch: bzr merge lp:~saviq/elisa/subtitle_chardet
Reviewer Review Type Date Requested Status
Olivier Tilloy Needs Fixing
Review via email: mp+25349@code.launchpad.net

Description of the change

This bundle implements three ways of subtitle encoding detection, in
order:
a) user-defined list of encodings
b) chardet for automatic encoding detection
c) i18n-defined list of encodings

The user can define a list of encodings to try in the config file, first
encoding that will succesfully load the file will be used.

On initial install a) won't be used because the default encoding list is
empty. Automatic detection is done by python-chardet module [1].
Currently chardet is used in try ... except blocks, so it's not a hard
dependency, although it's much encouraged.

If a) is not used or fails and b) fails or is less confident of it's
findings than 0.9 on a [0, 1] scale, c) is tried - a list of encodings
defined by the translator for current locale is used just as in a). If
this fails, the encoding detected in b) is used.

The careful reviewer will see that this bundle does not introduce any
regressions - all subtitles are loaded as usual. The routines
implemented in this bundle can be tested as follows:

* on initial install without user- or i18n- defined encoding and no
python-chardet installed, subtitles will be loaded with default
gstreamer locale. Then config-file support should be tried, both with
correct list of encodings and one that will fail (i.e. ['ascii']). In
both cases the subtitles will load, but will be displayed correctly only
in the first case;
* after updating the translation template (setup.py pot_update) file and
catalog file for your preferred locale (setup.py update_catalog) and
subsequent build of the catalogs (setup.py build_po), the two previous
tests for user-defined encodings should be repeated. In this case the
failing example should be corrected by i18n support as long as the right
encoding was added in the language catalog;
* it's now time to install python-chardet (packaged for most major
distros) and run your tests again. Empty the config encodings list and
remove the compiled catalog files (*.mo) and the encoding should still
be detected properly and the subtitles displayed correctly. There are
mostly issues with differentiating WINDOWS-1250 from ISO-8859-2 (Central
European).

IMPORTANT:
Applying this bundle should be followed by adding python-chardet [1] to
the windows build.

[1] http://chardet.feedparser.org/

To post a comment you must log in.
Revision history for this message
Olivier Tilloy (osomon) wrote :

Thanks for this cool patch!

I've played with it and looked at it carefully, and overall it looks very good to me, a nice improvement in subtitles handling.

My functional tests were mainly focused on the use of chardet to automagically detected the encodings, and I used various subtitle files in the following languages: Arabic, Czech, Danish, French, Icelandic, Russian, Turkish, Vietnamese (of course I'm fluent in all of them, why d'you ask?).

On my setup, chardet does a better job for some languages (Russian for example), but it also introduces a pretty bad regression for French (the file I tested is obviously encoded in ISO-8859, but chardet detects it as windows-1255, with a confidence of 0.99). Consequently, even if I localize the standard encodings string, the wrong encoding will be used (because of the order methods are tried in). The only way to work around this issue is to have "ISO-8859" listed as my default encoding for subtitles in the configuration. The problem is, this value isn't there by default. How about defining a dictionary of well-known default encodings for given locales, and using it to populate the subtitle_encoding setting? E.g.: {'fr_FR': ['ISO-8859']} ...

I realize it's yet another clanky solution on top of an already complex process to get the subtitles right, but I wouldn't want to see regressions introduced, especially of that kind.
Any criticism is welcome, and I'm happy to discuss/review alternative solutions.

And finally, a tiny remark on the style of the code itself. In the "try... except NameError" block, you enclosed a lot of code. As I understand it, the only statement that may raise a NameError is "chardet = UniversalDetector()", so it would be better to leave the rest out of the "try... except" block (use e.g. an "else" clause).

review: Needs Fixing
Revision history for this message
Michał Sawicz (saviq) wrote :

Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
> On my setup, chardet does a better job for some languages (Russian for
> example), but it also introduces a pretty bad regression for French
> (the file I tested is obviously encoded in ISO-8859, but chardet
> detects it as windows-1255, with a confidence of 0.99). Consequently,
> even if I localize the standard encodings string, the wrong encoding
> will be used (because of the order methods are tried in). The only way
> to work around this issue is to have "ISO-8859" listed as my default
> encoding for subtitles in the configuration. The problem is, this
> value isn't there by default. How about defining a dictionary of
> well-known default encodings for given locales, and using it to
> populate the subtitle_encoding setting? E.g.: {'fr_FR':
> ['ISO-8859']} ...
>
> I realize it's yet another clanky solution on top of an already
> complex process to get the subtitles right, but I wouldn't want to see
> regressions introduced, especially of that kind.
> Any criticism is welcome, and I'm happy to discuss/review alternative
> solutions.

That's (almost) exactly what I've done through i18n. There's usually no
way to know what language the subtitles are in, we only know the locale
and we can assume that subtitles are in this language. Maybe the list
defined by translators should have precedence over chardet?

It's (almost) the same with ISO-8859-2 and CP1250, which are both
charsets for Polish, but IIRC chardet reported a much lower confidence -
hence the 'confidence < 0.9' condition.

I will try and stress test chardet for your case and will report back.
Of course we can't push it with such a regression.

> And finally, a tiny remark on the style of the code itself. In the
> "try... except NameError" block, you enclosed a lot of code. As I
> understand it, the only statement that may raise a NameError is
> "chardet = UniversalDetector()", so it would be better to leave the
> rest out of the "try... except" block (use e.g. an "else" clause).

I think when I wrote this patch I wasn't yet aware of the "else" clause
for "try... except" :).

Thanks for the review, hope we can get it right. I even think this is
something that should be fixed in chardet, will try and get in touch
with to see what can be done.

--
Michał Sawicz <email address hidden>

Revision history for this message
Olivier Tilloy (osomon) wrote :

On 2010-05-20, Michał Sawicz <email address hidden> wrote:
> Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
>> On my setup, chardet does a better job for some languages (Russian for
>> example), but it also introduces a pretty bad regression for French
>> (the file I tested is obviously encoded in ISO-8859, but chardet
>> detects it as windows-1255, with a confidence of 0.99). Consequently,
>> even if I localize the standard encodings string, the wrong encoding
>> will be used (because of the order methods are tried in). The only way
>> to work around this issue is to have "ISO-8859" listed as my default
>> encoding for subtitles in the configuration. The problem is, this
>> value isn't there by default. How about defining a dictionary of
>> well-known default encodings for given locales, and using it to
>> populate the subtitle_encoding setting? E.g.: {'fr_FR':
>> ['ISO-8859']} ...
>>
>> I realize it's yet another clanky solution on top of an already
>> complex process to get the subtitles right, but I wouldn't want to see
>> regressions introduced, especially of that kind.
>> Any criticism is welcome, and I'm happy to discuss/review alternative
>> solutions.
>
> That's (almost) exactly what I've done through i18n. There's usually no
> way to know what language the subtitles are in, we only know the locale
> and we can assume that subtitles are in this language. Maybe the list
> defined by translators should have precedence over chardet?

That's one possible solution, yes. Needs testing.

> It's (almost) the same with ISO-8859-2 and CP1250, which are both
> charsets for Polish, but IIRC chardet reported a much lower confidence -
> hence the 'confidence < 0.9' condition.

The condition looks good to me, the problem is false positives.

> I will try and stress test chardet for your case and will report back.
> Of course we can't push it with such a regression.
>
>> And finally, a tiny remark on the style of the code itself. In the
>> "try... except NameError" block, you enclosed a lot of code. As I
>> understand it, the only statement that may raise a NameError is
>> "chardet = UniversalDetector()", so it would be better to leave the
>> rest out of the "try... except" block (use e.g. an "else" clause).
>
> I think when I wrote this patch I wasn't yet aware of the "else" clause
> for "try... except" :).
>
> Thanks for the review, hope we can get it right. I even think this is
> something that should be fixed in chardet, will try and get in touch
> with to see what can be done.

Definitely the way to go, if there is a way to get this fixed in
chardet, it's even better. Note that I can provide the subtitle files I
used to test on demand.

Unmerged revisions

1616. By Michał Sawicz

Add automatic subtitle encoding detection

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'elisa-core/README'
2--- elisa-core/README 2009-11-09 15:44:51 +0000
3+++ elisa-core/README 2010-05-14 17:46:28 +0000
4@@ -32,6 +32,7 @@
5 - python-pgm 0.3.12 https://core.fluendo.com/pigment/trac
6 - python-simplejson
7 - python-cssutils
8+- python-chardet
9
10 * for the development version of moovida, you should always use
11 pigment-svn/python-pigment-svn
12
13=== modified file 'elisa-plugins/elisa/plugins/poblesec/player_video.py'
14--- elisa-plugins/elisa/plugins/poblesec/player_video.py 2010-01-28 17:18:54 +0000
15+++ elisa-plugins/elisa/plugins/poblesec/player_video.py 2010-05-14 17:46:28 +0000
16@@ -54,6 +54,11 @@
17
18 import platform
19 from urllib import quote
20+# try loading chardet to detect subtitle encoding
21+try:
22+ from chardet.universaldetector import UniversalDetector
23+except ImportError:
24+ pass
25 try:
26 from elisa.plugins.database.models import File as DBFile
27 except ImportError:
28@@ -711,6 +716,60 @@
29 quoted_sub_uri = quote(sub_uri, ':/')
30 self.pipeline.set_property('suburi', quoted_sub_uri)
31 self.info("Loaded subtitles at %r", quoted_sub_uri)
32+
33+ def try_encodings(encoding_list, file_path):
34+ for encoding in encoding_list:
35+ try:
36+ file = open(file_path)
37+ unicode(file.read(), encoding, 'strict')
38+ except UnicodeDecodeError:
39+ self.debug('Failed encoding %s', encoding)
40+ continue
41+ finally:
42+ file.close()
43+ return encoding
44+ return None
45+
46+ encoding = None
47+ confidence = None
48+
49+ # try custom encodings list
50+ if self.config.get('subtitle_encoding'):
51+ encoding = \
52+ try_encodings(self.config.get('subtitle_encoding'), \
53+ sub_file)
54+ if not encoding:
55+ # try using chardet for subtitle charset detection
56+ try:
57+ file = open(sub_file)
58+ chardet = UniversalDetector()
59+ for line in file:
60+ chardet.feed(line)
61+ if chardet.done:
62+ break
63+ chardet.close()
64+ if chardet.result['encoding']:
65+ encoding = chardet.result['encoding']
66+ confidence = chardet.result['confidence']
67+ self.info("Detected encoding %s with %f confidence", \
68+ chardet.result['encoding'], \
69+ chardet.result['confidence'])
70+ except NameError:
71+ pass
72+ finally:
73+ file.close()
74+ # try encoding list defined by translators
75+ if not encoding or (confidence is not None and confidence < 0.9):
76+ # NOTE: a space separated list of possible subtitle
77+ # encodings for your language
78+ i18n_encoding_list = _('ascii utf8').split(' ')
79+ i18n_encoding = \
80+ try_encodings(i18n_encoding_list, sub_file)
81+ if i18n_encoding:
82+ encoding = i18n_encoding
83+ if encoding:
84+ self.pipeline.set_property('subtitle-encoding', encoding)
85+ self.info("Using encoding %s", encoding)
86 found = True
87 break
88 if not found:
89@@ -1690,13 +1749,18 @@
90 default_config = {'audio_sink' : get_default_audio_sink(),
91 'subtitle_font' : 'Liberation Sans',
92 'subtitle_font_size_factor': '0.0612',
93+ 'subtitle_encoding' : []
94 }
95
96- config_doc = {'subtitle_font' : 'Font name used for subtitles',
97+ config_doc = {'audio_sink' : 'give the name of the gstreamer audio sink ' \
98+ 'elisa should use. If not set autoaudiosink '\
99+ 'will be used.',
100+ 'subtitle_font' : 'Font name used for subtitles',
101 'subtitle_font_size_factor': 'Factor multiplied by video '\
102 'height to obtain subtitle '\
103 'font size',
104- 'audio_sink' : 'give the name of the gstreamer audio sink' \
105- 'elisa should use. If not set autoaudiosink'\
106- 'will be used.',
107+ 'subtitle_encoding' : 'A list of character encodings to ' \
108+ 'try in order. If not set, automatic ' \
109+ 'detection and i18n-defined encodings ' \
110+ 'will be tried'
111 }
112
113=== modified file 'elisa-plugins/elisa/plugins/poblesec/setup.py'
114--- elisa-plugins/elisa/plugins/poblesec/setup.py 2010-01-29 11:05:05 +0000
115+++ elisa-plugins/elisa/plugins/poblesec/setup.py 2010-05-14 17:46:28 +0000
116@@ -12,6 +12,7 @@
117 license='GPL3',
118 author='Moovida Developers',
119 author_email='elisa@lists.fluendo.com',
120+ install_requires=['chardet'],
121 namespace_packages=['elisa', 'elisa.plugins'],
122 packages=packages,
123 package_dir=package_dir,

Subscribers

People subscribed via source and target branches