Merge lp:~saviq/elisa/subtitle_chardet into lp:elisa

Proposed by Michał Sawicz
Status: Needs review
Proposed branch: lp:~saviq/elisa/subtitle_chardet
Merge into: lp:elisa
Diff against target: 123 lines (+70/-4)
3 files modified
elisa-core/README (+1/-0)
elisa-plugins/elisa/plugins/poblesec/player_video.py (+68/-4)
elisa-plugins/elisa/plugins/poblesec/setup.py (+1/-0)
To merge this branch: bzr merge lp:~saviq/elisa/subtitle_chardet
Reviewer Review Type Date Requested Status
Olivier Tilloy Needs Fixing
Review via email: mp+25349@code.launchpad.net

Description of the change

This bundle implements three ways of subtitle encoding detection, in
order:
a) user-defined list of encodings
b) chardet for automatic encoding detection
c) i18n-defined list of encodings

The user can define a list of encodings to try in the config file, first
encoding that will succesfully load the file will be used.

On initial install a) won't be used because the default encoding list is
empty. Automatic detection is done by python-chardet module [1].
Currently chardet is used in try ... except blocks, so it's not a hard
dependency, although it's much encouraged.

If a) is not used or fails and b) fails or is less confident of it's
findings than 0.9 on a [0, 1] scale, c) is tried - a list of encodings
defined by the translator for current locale is used just as in a). If
this fails, the encoding detected in b) is used.

The careful reviewer will see that this bundle does not introduce any
regressions - all subtitles are loaded as usual. The routines
implemented in this bundle can be tested as follows:

* on initial install without user- or i18n- defined encoding and no
python-chardet installed, subtitles will be loaded with default
gstreamer locale. Then config-file support should be tried, both with
correct list of encodings and one that will fail (i.e. ['ascii']). In
both cases the subtitles will load, but will be displayed correctly only
in the first case;
* after updating the translation template (setup.py pot_update) file and
catalog file for your preferred locale (setup.py update_catalog) and
subsequent build of the catalogs (setup.py build_po), the two previous
tests for user-defined encodings should be repeated. In this case the
failing example should be corrected by i18n support as long as the right
encoding was added in the language catalog;
* it's now time to install python-chardet (packaged for most major
distros) and run your tests again. Empty the config encodings list and
remove the compiled catalog files (*.mo) and the encoding should still
be detected properly and the subtitles displayed correctly. There are
mostly issues with differentiating WINDOWS-1250 from ISO-8859-2 (Central
European).

IMPORTANT:
Applying this bundle should be followed by adding python-chardet [1] to
the windows build.

[1] http://chardet.feedparser.org/

To post a comment you must log in.
Revision history for this message
Olivier Tilloy (osomon) wrote :

Thanks for this cool patch!

I've played with it and looked at it carefully, and overall it looks very good to me, a nice improvement in subtitles handling.

My functional tests were mainly focused on the use of chardet to automagically detected the encodings, and I used various subtitle files in the following languages: Arabic, Czech, Danish, French, Icelandic, Russian, Turkish, Vietnamese (of course I'm fluent in all of them, why d'you ask?).

On my setup, chardet does a better job for some languages (Russian for example), but it also introduces a pretty bad regression for French (the file I tested is obviously encoded in ISO-8859, but chardet detects it as windows-1255, with a confidence of 0.99). Consequently, even if I localize the standard encodings string, the wrong encoding will be used (because of the order methods are tried in). The only way to work around this issue is to have "ISO-8859" listed as my default encoding for subtitles in the configuration. The problem is, this value isn't there by default. How about defining a dictionary of well-known default encodings for given locales, and using it to populate the subtitle_encoding setting? E.g.: {'fr_FR': ['ISO-8859']} ...

I realize it's yet another clanky solution on top of an already complex process to get the subtitles right, but I wouldn't want to see regressions introduced, especially of that kind.
Any criticism is welcome, and I'm happy to discuss/review alternative solutions.

And finally, a tiny remark on the style of the code itself. In the "try... except NameError" block, you enclosed a lot of code. As I understand it, the only statement that may raise a NameError is "chardet = UniversalDetector()", so it would be better to leave the rest out of the "try... except" block (use e.g. an "else" clause).

review: Needs Fixing
Revision history for this message
Michał Sawicz (saviq) wrote :

Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
> On my setup, chardet does a better job for some languages (Russian for
> example), but it also introduces a pretty bad regression for French
> (the file I tested is obviously encoded in ISO-8859, but chardet
> detects it as windows-1255, with a confidence of 0.99). Consequently,
> even if I localize the standard encodings string, the wrong encoding
> will be used (because of the order methods are tried in). The only way
> to work around this issue is to have "ISO-8859" listed as my default
> encoding for subtitles in the configuration. The problem is, this
> value isn't there by default. How about defining a dictionary of
> well-known default encodings for given locales, and using it to
> populate the subtitle_encoding setting? E.g.: {'fr_FR':
> ['ISO-8859']} ...
>
> I realize it's yet another clanky solution on top of an already
> complex process to get the subtitles right, but I wouldn't want to see
> regressions introduced, especially of that kind.
> Any criticism is welcome, and I'm happy to discuss/review alternative
> solutions.

That's (almost) exactly what I've done through i18n. There's usually no
way to know what language the subtitles are in, we only know the locale
and we can assume that subtitles are in this language. Maybe the list
defined by translators should have precedence over chardet?

It's (almost) the same with ISO-8859-2 and CP1250, which are both
charsets for Polish, but IIRC chardet reported a much lower confidence -
hence the 'confidence < 0.9' condition.

I will try and stress test chardet for your case and will report back.
Of course we can't push it with such a regression.

> And finally, a tiny remark on the style of the code itself. In the
> "try... except NameError" block, you enclosed a lot of code. As I
> understand it, the only statement that may raise a NameError is
> "chardet = UniversalDetector()", so it would be better to leave the
> rest out of the "try... except" block (use e.g. an "else" clause).

I think when I wrote this patch I wasn't yet aware of the "else" clause
for "try... except" :).

Thanks for the review, hope we can get it right. I even think this is
something that should be fixed in chardet, will try and get in touch
with to see what can be done.

--
Michał Sawicz <email address hidden>

Revision history for this message
Olivier Tilloy (osomon) wrote :

On 2010-05-20, Michał Sawicz <email address hidden> wrote:
> Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
>> On my setup, chardet does a better job for some languages (Russian for
>> example), but it also introduces a pretty bad regression for French
>> (the file I tested is obviously encoded in ISO-8859, but chardet
>> detects it as windows-1255, with a confidence of 0.99). Consequently,
>> even if I localize the standard encodings string, the wrong encoding
>> will be used (because of the order methods are tried in). The only way
>> to work around this issue is to have "ISO-8859" listed as my default
>> encoding for subtitles in the configuration. The problem is, this
>> value isn't there by default. How about defining a dictionary of
>> well-known default encodings for given locales, and using it to
>> populate the subtitle_encoding setting? E.g.: {'fr_FR':
>> ['ISO-8859']} ...
>>
>> I realize it's yet another clanky solution on top of an already
>> complex process to get the subtitles right, but I wouldn't want to see
>> regressions introduced, especially of that kind.
>> Any criticism is welcome, and I'm happy to discuss/review alternative
>> solutions.
>
> That's (almost) exactly what I've done through i18n. There's usually no
> way to know what language the subtitles are in, we only know the locale
> and we can assume that subtitles are in this language. Maybe the list
> defined by translators should have precedence over chardet?

That's one possible solution, yes. Needs testing.

> It's (almost) the same with ISO-8859-2 and CP1250, which are both
> charsets for Polish, but IIRC chardet reported a much lower confidence -
> hence the 'confidence < 0.9' condition.

The condition looks good to me, the problem is false positives.

> I will try and stress test chardet for your case and will report back.
> Of course we can't push it with such a regression.
>
>> And finally, a tiny remark on the style of the code itself. In the
>> "try... except NameError" block, you enclosed a lot of code. As I
>> understand it, the only statement that may raise a NameError is
>> "chardet = UniversalDetector()", so it would be better to leave the
>> rest out of the "try... except" block (use e.g. an "else" clause).
>
> I think when I wrote this patch I wasn't yet aware of the "else" clause
> for "try... except" :).
>
> Thanks for the review, hope we can get it right. I even think this is
> something that should be fixed in chardet, will try and get in touch
> with to see what can be done.

Definitely the way to go, if there is a way to get this fixed in
chardet, it's even better. Note that I can provide the subtitle files I
used to test on demand.

Unmerged revisions

1616. By Michał Sawicz

Add automatic subtitle encoding detection

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'elisa-core/README'
--- elisa-core/README 2009-11-09 15:44:51 +0000
+++ elisa-core/README 2010-05-14 17:46:28 +0000
@@ -32,6 +32,7 @@
32- python-pgm 0.3.12 https://core.fluendo.com/pigment/trac32- python-pgm 0.3.12 https://core.fluendo.com/pigment/trac
33- python-simplejson33- python-simplejson
34- python-cssutils34- python-cssutils
35- python-chardet
3536
36 * for the development version of moovida, you should always use37 * for the development version of moovida, you should always use
37 pigment-svn/python-pigment-svn38 pigment-svn/python-pigment-svn
3839
=== modified file 'elisa-plugins/elisa/plugins/poblesec/player_video.py'
--- elisa-plugins/elisa/plugins/poblesec/player_video.py 2010-01-28 17:18:54 +0000
+++ elisa-plugins/elisa/plugins/poblesec/player_video.py 2010-05-14 17:46:28 +0000
@@ -54,6 +54,11 @@
5454
55import platform55import platform
56from urllib import quote56from urllib import quote
57# try loading chardet to detect subtitle encoding
58try:
59 from chardet.universaldetector import UniversalDetector
60except ImportError:
61 pass
57try:62try:
58 from elisa.plugins.database.models import File as DBFile63 from elisa.plugins.database.models import File as DBFile
59except ImportError:64except ImportError:
@@ -711,6 +716,60 @@
711 quoted_sub_uri = quote(sub_uri, ':/')716 quoted_sub_uri = quote(sub_uri, ':/')
712 self.pipeline.set_property('suburi', quoted_sub_uri)717 self.pipeline.set_property('suburi', quoted_sub_uri)
713 self.info("Loaded subtitles at %r", quoted_sub_uri)718 self.info("Loaded subtitles at %r", quoted_sub_uri)
719
720 def try_encodings(encoding_list, file_path):
721 for encoding in encoding_list:
722 try:
723 file = open(file_path)
724 unicode(file.read(), encoding, 'strict')
725 except UnicodeDecodeError:
726 self.debug('Failed encoding %s', encoding)
727 continue
728 finally:
729 file.close()
730 return encoding
731 return None
732
733 encoding = None
734 confidence = None
735
736 # try custom encodings list
737 if self.config.get('subtitle_encoding'):
738 encoding = \
739 try_encodings(self.config.get('subtitle_encoding'), \
740 sub_file)
741 if not encoding:
742 # try using chardet for subtitle charset detection
743 try:
744 file = open(sub_file)
745 chardet = UniversalDetector()
746 for line in file:
747 chardet.feed(line)
748 if chardet.done:
749 break
750 chardet.close()
751 if chardet.result['encoding']:
752 encoding = chardet.result['encoding']
753 confidence = chardet.result['confidence']
754 self.info("Detected encoding %s with %f confidence", \
755 chardet.result['encoding'], \
756 chardet.result['confidence'])
757 except NameError:
758 pass
759 finally:
760 file.close()
761 # try encoding list defined by translators
762 if not encoding or (confidence is not None and confidence < 0.9):
763 # NOTE: a space separated list of possible subtitle
764 # encodings for your language
765 i18n_encoding_list = _('ascii utf8').split(' ')
766 i18n_encoding = \
767 try_encodings(i18n_encoding_list, sub_file)
768 if i18n_encoding:
769 encoding = i18n_encoding
770 if encoding:
771 self.pipeline.set_property('subtitle-encoding', encoding)
772 self.info("Using encoding %s", encoding)
714 found = True773 found = True
715 break774 break
716 if not found:775 if not found:
@@ -1690,13 +1749,18 @@
1690 default_config = {'audio_sink' : get_default_audio_sink(),1749 default_config = {'audio_sink' : get_default_audio_sink(),
1691 'subtitle_font' : 'Liberation Sans',1750 'subtitle_font' : 'Liberation Sans',
1692 'subtitle_font_size_factor': '0.0612',1751 'subtitle_font_size_factor': '0.0612',
1752 'subtitle_encoding' : []
1693 }1753 }
16941754
1695 config_doc = {'subtitle_font' : 'Font name used for subtitles',1755 config_doc = {'audio_sink' : 'give the name of the gstreamer audio sink ' \
1756 'elisa should use. If not set autoaudiosink '\
1757 'will be used.',
1758 'subtitle_font' : 'Font name used for subtitles',
1696 'subtitle_font_size_factor': 'Factor multiplied by video '\1759 'subtitle_font_size_factor': 'Factor multiplied by video '\
1697 'height to obtain subtitle '\1760 'height to obtain subtitle '\
1698 'font size',1761 'font size',
1699 'audio_sink' : 'give the name of the gstreamer audio sink' \1762 'subtitle_encoding' : 'A list of character encodings to ' \
1700 'elisa should use. If not set autoaudiosink'\1763 'try in order. If not set, automatic ' \
1701 'will be used.',1764 'detection and i18n-defined encodings ' \
1765 'will be tried'
1702 }1766 }
17031767
=== modified file 'elisa-plugins/elisa/plugins/poblesec/setup.py'
--- elisa-plugins/elisa/plugins/poblesec/setup.py 2010-01-29 11:05:05 +0000
+++ elisa-plugins/elisa/plugins/poblesec/setup.py 2010-05-14 17:46:28 +0000
@@ -12,6 +12,7 @@
12 license='GPL3',12 license='GPL3',
13 author='Moovida Developers',13 author='Moovida Developers',
14 author_email='elisa@lists.fluendo.com',14 author_email='elisa@lists.fluendo.com',
15 install_requires=['chardet'],
15 namespace_packages=['elisa', 'elisa.plugins'],16 namespace_packages=['elisa', 'elisa.plugins'],
16 packages=packages,17 packages=packages,
17 package_dir=package_dir,18 package_dir=package_dir,

Subscribers

People subscribed via source and target branches