calibre

Overview
Code
Bugs
Blueprints
Translations
Answers

Merge lp:~ldolse/calibre/heuristics into lp:calibre

heuristics
Merge into trunk

Proposed by Lee on 2012-04-20

Status:	Merged
Merged at revision:	11963
Proposed branch:	lp:~ldolse/calibre/heuristics
Merge into:	lp:calibre
Diff against target:	37 lines (+11/-3) 2 files modified src/calibre/ebooks/conversion/preprocess.py (+1/-1) src/calibre/ebooks/conversion/utils.py (+10/-2)
To merge this branch:	bzr merge lp:~ldolse/calibre/heuristics
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Kovid Goyal		2012-04-20	Pending
Review via email: mp+102903@code.launchpad.net

Description of the change

Discovered one other change in the European character patch that would cause false positive line unwrapping, updated the pattern to revert to the original behavior favoring of false negatives.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Download diff
Side-by-side diff

Subscribers

People subscribed via source and target branches

to all changes:

Ali Baba

Kovid Goyal

Pankaj

Timothy Legge

gstoychev

1	=== modified file 'src/calibre/ebooks/conversion/preprocess.py'
2	--- src/calibre/ebooks/conversion/preprocess.py 2012-04-13 15:23:43 +0000
3	+++ src/calibre/ebooks/conversion/preprocess.py 2012-04-20 17:04:32 +0000
4	@@ -559,7 +559,7 @@
5	end_rules.append((re.compile(u'(?<=.{%i}[–—])\s<p>\s(?=[[a-z\d])' % length), lambda match: ''))
6	end_rules.append(
7	# Un wrap using punctuation
8	- (re.compile(u'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:“”)\IA\u00DF]\|(?<!\&\w{4});))\s(?P<ital></(i\|b\|u)>)?\s(</p>\s<p>\s)+\s(?=(<(i\|b\|u)>)?\s[\w\d$(])' % length, re.UNICODE), wrap_lines),
9	+ (re.compile(u'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]\|(?<!\&\w{4});))\s(?P<ital></(i\|b\|u)>)?\s(</p>\s<p>\s)+\s(?=(<(i\|b\|u)>)?\s[\w\d$(])' % length, re.UNICODE), wrap_lines),
10	)
11
12	for rule in self.PREPROCESS + start_rules:
13
14	=== modified file 'src/calibre/ebooks/conversion/utils.py'
15	--- src/calibre/ebooks/conversion/utils.py 2012-04-20 13:52:57 +0000
16	+++ src/calibre/ebooks/conversion/utils.py 2012-04-20 17:04:32 +0000
17	@@ -316,10 +316,18 @@
18	'''
19	Unwraps lines based on line length and punctuation
20	supports a range of html markup and text files
21	+
22	+ the lookahead regex below is meant look for any non-full stop characters - punctuation
23	+ characters which can be used as a full stop should not be added below - e.g. ?!“”. etc
24	+ the reason for this is to prevent false positive wrapping. False positives are more
25	+ difficult to detect than false negatives during a manual review of the doc
26	+
27	+ This function intentionally leaves hyphenated content alone as that is handled by the
28	+ dehyphenate routine in a separate step
29	'''
30	+
31	# define the pieces of the regex
32	-
33	- lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:“”)\IA\u00DF]\|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
34	+ lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]\|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
35	em_en_lookahead = "(?<=.{"+str(length)+u"}[\u2013\u2014])"
36	soft_hyphen = u"\xad"
37	line_ending = "\s</(span\|[iubp]\|div)>\s(</(span\|[iubp]\|div)>)?"

calibre

Merge lp:~ldolse/calibre/heuristics into lp:calibre

Commit message

Description of the change

Preview Diff

Subscribers