Merge lp:~ldolse/calibre/heuristics into lp:calibre

Proposed by Lee
Status: Merged
Merged at revision: 11963
Proposed branch: lp:~ldolse/calibre/heuristics
Merge into: lp:calibre
Diff against target: 37 lines (+11/-3)
2 files modified
src/calibre/ebooks/conversion/preprocess.py (+1/-1)
src/calibre/ebooks/conversion/utils.py (+10/-2)
To merge this branch: bzr merge lp:~ldolse/calibre/heuristics
Reviewer Review Type Date Requested Status
Kovid Goyal Pending
Review via email: mp+102903@code.launchpad.net

Description of the change

Discovered one other change in the European character patch that would cause false positive line unwrapping, updated the pattern to revert to the original behavior favoring of false negatives.

To post a comment you must log in.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'src/calibre/ebooks/conversion/preprocess.py'
--- src/calibre/ebooks/conversion/preprocess.py 2012-04-13 15:23:43 +0000
+++ src/calibre/ebooks/conversion/preprocess.py 2012-04-20 17:04:32 +0000
@@ -559,7 +559,7 @@
559 end_rules.append((re.compile(u'(?<=.{%i}[–—])\s*<p>\s*(?=[[a-z\d])' % length), lambda match: ''))559 end_rules.append((re.compile(u'(?<=.{%i}[–—])\s*<p>\s*(?=[[a-z\d])' % length), lambda match: ''))
560 end_rules.append(560 end_rules.append(
561 # Un wrap using punctuation561 # Un wrap using punctuation
562 (re.compile(u'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:“”)\IA\u00DF]|(?<!\&\w{4});))\s*(?P<ital></(i|b|u)>)?\s*(</p>\s*<p>\s*)+\s*(?=(<(i|b|u)>)?\s*[\w\d$(])' % length, re.UNICODE), wrap_lines),562 (re.compile(u'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))\s*(?P<ital></(i|b|u)>)?\s*(</p>\s*<p>\s*)+\s*(?=(<(i|b|u)>)?\s*[\w\d$(])' % length, re.UNICODE), wrap_lines),
563 )563 )
564564
565 for rule in self.PREPROCESS + start_rules:565 for rule in self.PREPROCESS + start_rules:
566566
=== modified file 'src/calibre/ebooks/conversion/utils.py'
--- src/calibre/ebooks/conversion/utils.py 2012-04-20 13:52:57 +0000
+++ src/calibre/ebooks/conversion/utils.py 2012-04-20 17:04:32 +0000
@@ -316,10 +316,18 @@
316 '''316 '''
317 Unwraps lines based on line length and punctuation317 Unwraps lines based on line length and punctuation
318 supports a range of html markup and text files318 supports a range of html markup and text files
319
320 the lookahead regex below is meant look for any non-full stop characters - punctuation
321 characters which can be used as a full stop should *not* be added below - e.g. ?!“”. etc
322 the reason for this is to prevent false positive wrapping. False positives are more
323 difficult to detect than false negatives during a manual review of the doc
324
325 This function intentionally leaves hyphenated content alone as that is handled by the
326 dehyphenate routine in a separate step
319 '''327 '''
328
320 # define the pieces of the regex329 # define the pieces of the regex
321330 lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
322 lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:“”)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
323 em_en_lookahead = "(?<=.{"+str(length)+u"}[\u2013\u2014])"331 em_en_lookahead = "(?<=.{"+str(length)+u"}[\u2013\u2014])"
324 soft_hyphen = u"\xad"332 soft_hyphen = u"\xad"
325 line_ending = "\s*</(span|[iubp]|div)>\s*(</(span|[iubp]|div)>)?"333 line_ending = "\s*</(span|[iubp]|div)>\s*(</(span|[iubp]|div)>)?"

Subscribers

People subscribed via source and target branches