Merge lp:~ldolse/calibre/heuristics into lp:calibre

Proposed by Lee
Status: Merged
Merged at revision: 11963
Proposed branch: lp:~ldolse/calibre/heuristics
Merge into: lp:calibre
Diff against target: 37 lines (+11/-3)
2 files modified
src/calibre/ebooks/conversion/preprocess.py (+1/-1)
src/calibre/ebooks/conversion/utils.py (+10/-2)
To merge this branch: bzr merge lp:~ldolse/calibre/heuristics
Reviewer Review Type Date Requested Status
Kovid Goyal Pending
Review via email: mp+102903@code.launchpad.net

Description of the change

Discovered one other change in the European character patch that would cause false positive line unwrapping, updated the pattern to revert to the original behavior favoring of false negatives.

To post a comment you must log in.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/calibre/ebooks/conversion/preprocess.py'
2--- src/calibre/ebooks/conversion/preprocess.py 2012-04-13 15:23:43 +0000
3+++ src/calibre/ebooks/conversion/preprocess.py 2012-04-20 17:04:32 +0000
4@@ -559,7 +559,7 @@
5 end_rules.append((re.compile(u'(?<=.{%i}[–—])\s*<p>\s*(?=[[a-z\d])' % length), lambda match: ''))
6 end_rules.append(
7 # Un wrap using punctuation
8- (re.compile(u'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:“”)\IA\u00DF]|(?<!\&\w{4});))\s*(?P<ital></(i|b|u)>)?\s*(</p>\s*<p>\s*)+\s*(?=(<(i|b|u)>)?\s*[\w\d$(])' % length, re.UNICODE), wrap_lines),
9+ (re.compile(u'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))\s*(?P<ital></(i|b|u)>)?\s*(</p>\s*<p>\s*)+\s*(?=(<(i|b|u)>)?\s*[\w\d$(])' % length, re.UNICODE), wrap_lines),
10 )
11
12 for rule in self.PREPROCESS + start_rules:
13
14=== modified file 'src/calibre/ebooks/conversion/utils.py'
15--- src/calibre/ebooks/conversion/utils.py 2012-04-20 13:52:57 +0000
16+++ src/calibre/ebooks/conversion/utils.py 2012-04-20 17:04:32 +0000
17@@ -316,10 +316,18 @@
18 '''
19 Unwraps lines based on line length and punctuation
20 supports a range of html markup and text files
21+
22+ the lookahead regex below is meant look for any non-full stop characters - punctuation
23+ characters which can be used as a full stop should *not* be added below - e.g. ?!“”. etc
24+ the reason for this is to prevent false positive wrapping. False positives are more
25+ difficult to detect than false negatives during a manual review of the doc
26+
27+ This function intentionally leaves hyphenated content alone as that is handled by the
28+ dehyphenate routine in a separate step
29 '''
30+
31 # define the pieces of the regex
32-
33- lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:“”)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
34+ lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
35 em_en_lookahead = "(?<=.{"+str(length)+u"}[\u2013\u2014])"
36 soft_hyphen = u"\xad"
37 line_ending = "\s*</(span|[iubp]|div)>\s*(</(span|[iubp]|div)>)?"

Subscribers

People subscribed via source and target branches