Merge into 1.0 : subtitle_chardet : Code : Elisa Media Center

Status:	Needs review
Proposed branch:	lp:~saviq/elisa/subtitle_chardet
Merge into:	lp:elisa
Diff against target:	123 lines (+70/-4) 3 files modified elisa-core/README (+1/-0) elisa-plugins/elisa/plugins/poblesec/player_video.py (+68/-4) elisa-plugins/elisa/plugins/poblesec/setup.py (+1/-0)
To merge this branch:	bzr merge lp:~saviq/elisa/subtitle_chardet
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Olivier Tilloy		2010-05-14	Needs Fixing on 2010-05-20
Review via email: mp+25349@code.launchpad.net

Description of the change

This bundle implements three ways of subtitle encoding detection, in
order:
a) user-defined list of encodings
b) chardet for automatic encoding detection
c) i18n-defined list of encodings

The user can define a list of encodings to try in the config file, first
encoding that will succesfully load the file will be used.

On initial install a) won't be used because the default encoding list is
empty. Automatic detection is done by python-chardet module [1].
Currently chardet is used in try ... except blocks, so it's not a hard
dependency, although it's much encouraged.

If a) is not used or fails and b) fails or is less confident of it's
findings than 0.9 on a [0, 1] scale, c) is tried - a list of encodings
defined by the translator for current locale is used just as in a). If
this fails, the encoding detected in b) is used.

The careful reviewer will see that this bundle does not introduce any
regressions - all subtitles are loaded as usual. The routines
implemented in this bundle can be tested as follows:

* on initial install without user- or i18n- defined encoding and no
python-chardet installed, subtitles will be loaded with default
gstreamer locale. Then config-file support should be tried, both with
correct list of encodings and one that will fail (i.e. ['ascii']). In
both cases the subtitles will load, but will be displayed correctly only
in the first case;
* after updating the translation template (setup.py pot_update) file and
catalog file for your preferred locale (setup.py update_catalog) and
subsequent build of the catalogs (setup.py build_po), the two previous
tests for user-defined encodings should be repeated. In this case the
failing example should be corrected by i18n support as long as the right
encoding was added in the language catalog;
* it's now time to install python-chardet (packaged for most major
distros) and run your tests again. Empty the config encodings list and
remove the compiled catalog files (*.mo) and the encoding should still
be detected properly and the subtitles displayed correctly. There are
mostly issues with differentiating WINDOWS-1250 from ISO-8859-2 (Central
European).

IMPORTANT:
Applying this bundle should be followed by adding python-chardet [1] to
the windows build.

[1] http://chardet.feedparser.org/

Revision history for this message

Olivier Tilloy (osomon) wrote on 2010-05-20:

#

Thanks for this cool patch!

I've played with it and looked at it carefully, and overall it looks very good to me, a nice improvement in subtitles handling.

My functional tests were mainly focused on the use of chardet to automagically detected the encodings, and I used various subtitle files in the following languages: Arabic, Czech, Danish, French, Icelandic, Russian, Turkish, Vietnamese (of course I'm fluent in all of them, why d'you ask?).

On my setup, chardet does a better job for some languages (Russian for example), but it also introduces a pretty bad regression for French (the file I tested is obviously encoded in ISO-8859, but chardet detects it as windows-1255, with a confidence of 0.99). Consequently, even if I localize the standard encodings string, the wrong encoding will be used (because of the order methods are tried in). The only way to work around this issue is to have "ISO-8859" listed as my default encoding for subtitles in the configuration. The problem is, this value isn't there by default. How about defining a dictionary of well-known default encodings for given locales, and using it to populate the subtitle_encoding setting? E.g.: {'fr_FR': ['ISO-8859']} ...

I realize it's yet another clanky solution on top of an already complex process to get the subtitles right, but I wouldn't want to see regressions introduced, especially of that kind.
Any criticism is welcome, and I'm happy to discuss/review alternative solutions.

And finally, a tiny remark on the style of the code itself. In the "try... except NameError" block, you enclosed a lot of code. As I understand it, the only statement that may raise a NameError is "chardet = UniversalDetector()", so it would be better to leave the rest out of the "try... except" block (use e.g. an "else" clause).

review: Needs Fixing

Reply

Revision history for this message

Michał Sawicz (saviq) wrote on 2010-05-20:

#

Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
> On my setup, chardet does a better job for some languages (Russian for
> example), but it also introduces a pretty bad regression for French
> (the file I tested is obviously encoded in ISO-8859, but chardet
> detects it as windows-1255, with a confidence of 0.99). Consequently,
> even if I localize the standard encodings string, the wrong encoding
> will be used (because of the order methods are tried in). The only way
> to work around this issue is to have "ISO-8859" listed as my default
> encoding for subtitles in the configuration. The problem is, this
> value isn't there by default. How about defining a dictionary of
> well-known default encodings for given locales, and using it to
> populate the subtitle_encoding setting? E.g.: {'fr_FR':
> ['ISO-8859']} ...
>
> I realize it's yet another clanky solution on top of an already
> complex process to get the subtitles right, but I wouldn't want to see
> regressions introduced, especially of that kind.
> Any criticism is welcome, and I'm happy to discuss/review alternative
> solutions.

That's (almost) exactly what I've done through i18n. There's usually no
way to know what language the subtitles are in, we only know the locale
and we can assume that subtitles are in this language. Maybe the list
defined by translators should have precedence over chardet?

It's (almost) the same with ISO-8859-2 and CP1250, which are both
charsets for Polish, but IIRC chardet reported a much lower confidence -
hence the 'confidence < 0.9' condition.

I will try and stress test chardet for your case and will report back.
Of course we can't push it with such a regression.

> And finally, a tiny remark on the style of the code itself. In the
> "try... except NameError" block, you enclosed a lot of code. As I
> understand it, the only statement that may raise a NameError is
> "chardet = UniversalDetector()", so it would be better to leave the
> rest out of the "try... except" block (use e.g. an "else" clause).

I think when I wrote this patch I wasn't yet aware of the "else" clause
for "try... except" :).

Thanks for the review, hope we can get it right. I even think this is
something that should be fixed in chardet, will try and get in touch
with to see what can be done.

--
Michał Sawicz <email address hidden>

Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
> On my setup, chardet does a better job for some languages (Russian for
> example), but it also introduces a pretty bad regression for French
> (the file I tested is obviously encoded in ISO-8859, but chardet
> detects it as windows-1255, with a confidence of 0.99). Consequently,
> even if I localize the standard encodings string, the wrong encoding
> will be used (because of the order methods are tried in). The only way
> to work around this issue is to have "ISO-8859" listed as my default
> encoding for subtitles in the configuration. The problem is, this
> value isn't there by default. How about defining a dictionary of
> well-known default encodings for given locales, and using it to
> populate the subtitle_encoding setting? E.g.: {'fr_FR':
> ['ISO-8859']} ...
>
> I realize it's yet another clanky solution on top of an already
> complex process to get the subtitles right, but I wouldn't want to see
> regressions introduced, especially of that kind.
> Any criticism is welcome, and I'm happy to discuss/review alternative
> solutions.

That's (almost) exactly what I've done through i18n. There's usually no
way to know what language the subtitles are in, we only know the locale
and we can assume that subtitles are in this language. Maybe the list
defined by translators should have precedence over chardet?

It's (almost) the same with ISO-8859-2 and CP1250, which are both
charsets for Polish, but IIRC chardet reported a much lower confidence -
hence the 'confidence < 0.9' condition.

I will try and stress test chardet for your case and will report back.
Of course we can't push it with such a regression.

> And finally, a tiny remark on the style of the code itself. In the
> "try... except NameError" block, you enclosed a lot of code. As I
> understand it, the only statement that may raise a NameError is
> "chardet = UniversalDetector()", so it would be better to leave the
> rest out of the "try... except" block (use e.g. an "else" clause).

I think when I wrote this patch I wasn't yet aware of the "else" clause
for "try... except" :).

Thanks for the review, hope we can get it right. I even think this is
something that should be fixed in chardet, will try and get in touch
with to see what can be done.

-- 
Michał Sawicz <michal@sawicz.net>

Reply

Revision history for this message

Olivier Tilloy (osomon) wrote on 2010-05-21:

#

On 2010-05-20, Michał Sawicz <email address hidden> wrote:
> Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
>> On my setup, chardet does a better job for some languages (Russian for
>> example), but it also introduces a pretty bad regression for French
>> (the file I tested is obviously encoded in ISO-8859, but chardet
>> detects it as windows-1255, with a confidence of 0.99). Consequently,
>> even if I localize the standard encodings string, the wrong encoding
>> will be used (because of the order methods are tried in). The only way
>> to work around this issue is to have "ISO-8859" listed as my default
>> encoding for subtitles in the configuration. The problem is, this
>> value isn't there by default. How about defining a dictionary of
>> well-known default encodings for given locales, and using it to
>> populate the subtitle_encoding setting? E.g.: {'fr_FR':
>> ['ISO-8859']} ...
>>
>> I realize it's yet another clanky solution on top of an already
>> complex process to get the subtitles right, but I wouldn't want to see
>> regressions introduced, especially of that kind.
>> Any criticism is welcome, and I'm happy to discuss/review alternative
>> solutions.
>
> That's (almost) exactly what I've done through i18n. There's usually no
> way to know what language the subtitles are in, we only know the locale
> and we can assume that subtitles are in this language. Maybe the list
> defined by translators should have precedence over chardet?

That's one possible solution, yes. Needs testing.

> It's (almost) the same with ISO-8859-2 and CP1250, which are both
> charsets for Polish, but IIRC chardet reported a much lower confidence -
> hence the 'confidence < 0.9' condition.

The condition looks good to me, the problem is false positives.

> I will try and stress test chardet for your case and will report back.
> Of course we can't push it with such a regression.
>
>> And finally, a tiny remark on the style of the code itself. In the
>> "try... except NameError" block, you enclosed a lot of code. As I
>> understand it, the only statement that may raise a NameError is
>> "chardet = UniversalDetector()", so it would be better to leave the
>> rest out of the "try... except" block (use e.g. an "else" clause).
>
> I think when I wrote this patch I wasn't yet aware of the "else" clause
> for "try... except" :).
>
> Thanks for the review, hope we can get it right. I even think this is
> something that should be fixed in chardet, will try and get in touch
> with to see what can be done.

Definitely the way to go, if there is a way to get this fixed in
chardet, it's even better. Note that I can provide the subtitle files I
used to test on demand.

On 2010-05-20, Michał Sawicz <michal@sawicz.net> wrote:
> Dnia 2010-05-20, czw o godzinie 18:19 +0000, Olivier Tilloy pisze:
>> On my setup, chardet does a better job for some languages (Russian for
>> example), but it also introduces a pretty bad regression for French
>> (the file I tested is obviously encoded in ISO-8859, but chardet
>> detects it as windows-1255, with a confidence of 0.99). Consequently,
>> even if I localize the standard encodings string, the wrong encoding
>> will be used (because of the order methods are tried in). The only way
>> to work around this issue is to have "ISO-8859" listed as my default
>> encoding for subtitles in the configuration. The problem is, this
>> value isn't there by default. How about defining a dictionary of
>> well-known default encodings for given locales, and using it to
>> populate the subtitle_encoding setting? E.g.: {'fr_FR':
>> ['ISO-8859']} ...
>>
>> I realize it's yet another clanky solution on top of an already
>> complex process to get the subtitles right, but I wouldn't want to see
>> regressions introduced, especially of that kind.
>> Any criticism is welcome, and I'm happy to discuss/review alternative
>> solutions.
> 
> That's (almost) exactly what I've done through i18n. There's usually no
> way to know what language the subtitles are in, we only know the locale
> and we can assume that subtitles are in this language. Maybe the list
> defined by translators should have precedence over chardet?

That's one possible solution, yes. Needs testing.

> It's (almost) the same with ISO-8859-2 and CP1250, which are both
> charsets for Polish, but IIRC chardet reported a much lower confidence -
> hence the 'confidence < 0.9' condition.

The condition looks good to me, the problem is false positives.

> I will try and stress test chardet for your case and will report back.
> Of course we can't push it with such a regression.
> 
>> And finally, a tiny remark on the style of the code itself. In the
>> "try... except NameError" block, you enclosed a lot of code. As I
>> understand it, the only statement that may raise a NameError is
>> "chardet = UniversalDetector()", so it would be better to leave the
>> rest out of the "try... except" block (use e.g. an "else" clause). 
> 
> I think when I wrote this patch I wasn't yet aware of the "else" clause
> for "try... except" :).
> 
> Thanks for the review, hope we can get it right. I even think this is
> something that should be fixed in chardet, will try and get in touch
> with to see what can be done.

Definitely the way to go, if there is a way to get this fixed in
chardet, it's even better. Note that I can provide the subtitle files I
used to test on demand.

Reply

Unmerged revisions

1616. By Michał Sawicz on 2010-05-14: Add automatic subtitle encoding detection

 === modified file 'elisa-core/README'
 --- elisa-core/README	2009-11-09 15:44:51 +0000
 +++ elisa-core/README	2010-05-14 17:46:28 +0000
@@ -32,6 +32,7 @@
  - python-pgm 0.3.12 https://core.fluendo.com/pigment/trac
  - python-simplejson
  - python-cssutils
++- python-chardet
    * for the development version of moovida, you should always use
      pigment-svn/python-pigment-svn
 === modified file 'elisa-plugins/elisa/plugins/poblesec/player_video.py'
 --- elisa-plugins/elisa/plugins/poblesec/player_video.py	2010-01-28 17:18:54 +0000
 +++ elisa-plugins/elisa/plugins/poblesec/player_video.py	2010-05-14 17:46:28 +0000
@@ -54,6 +54,11 @@
  import platform
  from urllib import quote
++# try loading chardet to detect subtitle encoding
++try:
++    from chardet.universaldetector import UniversalDetector
++except ImportError:
++    pass
  try:
      from elisa.plugins.database.models import File as DBFile
  except ImportError:
@@ -711,6 +716,60 @@
                      quoted_sub_uri = quote(sub_uri, ':/')
                      self.pipeline.set_property('suburi', quoted_sub_uri)
                      self.info("Loaded subtitles at %r", quoted_sub_uri)
++
++                    def try_encodings(encoding_list, file_path):
++                        for encoding in encoding_list:
++                            try:
++                                file = open(file_path)
++                                unicode(file.read(), encoding, 'strict')
++                            except UnicodeDecodeError:
++                                self.debug('Failed encoding %s', encoding)
++                                continue
++                            finally:
++                                file.close()
++                            return encoding
++                        return None
++
++                    encoding = None
++                    confidence = None
++
++                    # try custom encodings list
++                    if self.config.get('subtitle_encoding'):
++                        encoding = \
++                            try_encodings(self.config.get('subtitle_encoding'), \
++                                          sub_file)
++                    if not encoding:
++                        # try using chardet for subtitle charset detection
++                        try:
++                            file = open(sub_file)
++                            chardet = UniversalDetector()
++                            for line in file:
++                                chardet.feed(line)
++                                if chardet.done:
++                                    break
++                            chardet.close()
++                            if chardet.result['encoding']:
++                                encoding = chardet.result['encoding']
++                                confidence = chardet.result['confidence']
++                            self.info("Detected encoding %s with %f confidence", \
++                                chardet.result['encoding'], \
++                                chardet.result['confidence'])
++                        except NameError:
++                            pass
++                        finally:
++                            file.close()
++                    # try encoding list defined by translators
++                    if not encoding or (confidence is not None and confidence < 0.9):
++                        # NOTE: a space separated list of possible subtitle
++                        # encodings for your language
++                        i18n_encoding_list = _('ascii utf8').split(' ')
++                        i18n_encoding = \
++                            try_encodings(i18n_encoding_list, sub_file)
++                        if i18n_encoding:
++                            encoding = i18n_encoding
++                    if encoding:
++                        self.pipeline.set_property('subtitle-encoding', encoding)
++                    self.info("Using encoding %s", encoding)
                      found = True
                      break
          if not found:
@@ -1690,13 +1749,18 @@
      default_config = {'audio_sink' : get_default_audio_sink(),
                        'subtitle_font' : 'Liberation Sans',
                        'subtitle_font_size_factor': '0.0612',
++                      'subtitle_encoding' : []
+                       }
--    config_doc = {'subtitle_font' : 'Font name used for subtitles',
++    config_doc = {'audio_sink' : 'give the name of the gstreamer audio sink ' \
++                                 'elisa should use. If not set autoaudiosink '\
++                                 'will be used.',
++                  'subtitle_font' : 'Font name used for subtitles',
                    'subtitle_font_size_factor': 'Factor multiplied by video '\
                                                 'height to obtain subtitle '\
                                                 'font size',
--                  'audio_sink' : 'give the name of the gstreamer audio sink' \
--                                 'elisa should use. If not set autoaudiosink'\
--                                 'will be used.',
++                  'subtitle_encoding' : 'A list of character encodings to ' \
++                                        'try in order. If not set, automatic ' \
++                                        'detection and i18n-defined encodings ' \
++                                        'will be tried'
+                   }
 === modified file 'elisa-plugins/elisa/plugins/poblesec/setup.py'
 --- elisa-plugins/elisa/plugins/poblesec/setup.py	2010-01-29 11:05:05 +0000
 +++ elisa-plugins/elisa/plugins/poblesec/setup.py	2010-05-14 17:46:28 +0000
@@ -12,6 +12,7 @@
      license='GPL3',
      author='Moovida Developers',
      author_email='elisa@lists.fluendo.com',
++    install_requires=['chardet'],
      namespace_packages=['elisa', 'elisa.plugins'],
      packages=packages,
      package_dir=package_dir,

Elisa Media Center

Merge lp:~saviq/elisa/subtitle_chardet into lp:elisa

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers