Merge lp:~grishkin/chm2pdf/chm2pdf_branch into lp:~reto-knaak/chm2pdf/chm2pdf_branch

Proposed by Max Grishkin
Status: Needs review
Proposed branch: lp:~grishkin/chm2pdf/chm2pdf_branch
Merge into: lp:~reto-knaak/chm2pdf/chm2pdf_branch
Diff against target: 564 lines (+160/-94)
1 file modified
chm2pdf (+160/-94)
To merge this branch: bzr merge lp:~grishkin/chm2pdf/chm2pdf_branch
Reviewer Review Type Date Requested Status
Reto Knaak Pending
Review via email: mp+128385@code.launchpad.net

Description of the change

To post a comment you must log in.
Revision history for this message
Reto Knaak (reto-knaak) wrote :
Download full text (3.7 KiB)

Hi Grishkin (Max?) !

Thank you for the files... I'm not a real programmer (just tried to fix
some issues that where avoiding me to use the script) so I don't know if I
am the right person to make the code review.
It makes now a lot of monts I didn't boot up my virtual box with ubuntu and
I have the feeling I forgot most of what I learned trying to fix the script.

Anyway, this evening I had some time and began to download the files - just
to see what's going on.

My operating system is Win7, and if I open the CHM file from windows it
won't open, probably doe to the missinf toc!
Then I tryied to import it to calibre, and there if I open the CHM
something is displayed but it the "Liberty Bay" article from the online(!)
wikipedia.
If I convert the file to some other format (mobi), I get a page with "1951
Chicago Bears season" which seems me the right output.
So I'm not sure the demo chm file has a valid output, but I agree that it's
a good idea to try to extract what's there.

I'm not familiar with the code review process, and I am asking myself if
i/we ahould open a bug under ubuntu (that is what I did with the bugs i
found previously)?

I gave a quick glance at the diff and most are differences that are not
really there (probably some spaces), and the only true differences are:
in def get_html_list(cfile) and def get_objective_urls_list(filename).

For the first one, it's the first time I see "lambda" (so again, probably
I'm not the right one to review...).
I think I understood what it's meant for but I can't say I understand how
it works.... (if first way to retrieve the html files fail, use the second
one using all files found or something similar?)

For the second one, "my local chm2pdf" is like this:

*def get_objective_urls_list(filename):
    '''
    takes the list of files inside the chm archive, with the correct urls
of each one.
    '''
    os.system('enum_chmLib '+filename+' >
"'+CHM2PDF_WORK_DIR+'/urlslist.txt"')
    flist=open(CHM2PDF_WORK_DIR+'/urlslist.txt','rU')
    urls_list=[]
    for line in flist.readlines()[3:]:
        #print 'line',line
        #This won't work if internal paths of CHM contains spaces: e.g.
/doc space/ will only become /doc
        #spline=line.split()
        #urls_list.append(spline[5])
        #this should work better:
        spline= re.sub(r".*?normal file\s*(.*?)\n$", "\\1", line)
        if spline[0]=="/":
          #print "got spline="+spline
          urls_list.append( spline)
    flist.close()
    # os.remove(CHM2PDF_WORK_DIR+'/urlslist.txt')
    return urls_list
*
Does your solution work with chm paths containing spaces? (If you need a
sample file see
https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/894193 )
I have the feeling (not really run any scripts this evening and forgetting
pyton) that using urls_list.append(spline[5]) will fail in case of paths
with spaces!
I have also the feeling that my solution is not really state of the art, so
maybe you can suggest something that solves both problems?

Hope to hear you soon and Kind regards from the italian part of Switzerland!

Ciao
Reto Knaak

On Sun, Oct 7, 2012 at 5:23 PM, Grishkin <email address hidden> wrote:

> Grishkin has propose...

Read more...

lp:~grishkin/chm2pdf/chm2pdf_branch updated
14. By Grishkin <grishkin@mint>

Merged Reto's pathes, fixed problems with spaces

Revision history for this message
Max Grishkin (grishkin) wrote :
Download full text (6.2 KiB)

Hello, Reto, and thanks for constructive response!

I had been watching activity in chm2pdf google groups and on Launchpad for
a while and understand that you are just an ordinary user of chm2pdf, not a
maintainer or author of software. But I see that chm2pdf was published
quite a lot of time ago and until now there were no any bugfixes or
improvements, so the project may be considered abandoned. Some time ago you
were most active in project discussion and you own a branch of it on LP.
When I found this branch, I've decided to upload my chm2pdf version on
Launchpad too, just to make it public. And my point is that our branches
should be synchronized and both should have the latest chm2pdf version with
all available fixes.

Concerning that chm file from Google Group topic. It is certainly very
dirty and someone may argue that chm2pdf should not process such files
correctly. But it was created just to demonstrate the type of files, on
which chm2pdf failed before, and now it generates pdf's with them. So I've
just put two completely random Wikipedia articles into one chm file, they
even do not link to each other, that's why some software shows one page and
other software completely another page. Anyway, the resulting pdf contains
both pages.

I've downloaded your patch for spaces in names, but it appeared that I did
not understand to which version apply it - the one from distribution, from
code.google.com or from you branch? I've tried to merge it to different
versions by hand, but the resulting files still failed on chm file from
your Demo_CMH.zip. So I've just merged your patch to my branch and fixed
the rest so that conversion started to work for me - mainly added quotes
around filenames when needed. Seems that no-table-of-conents and
spaces-in-filenames fixes perfectly work together! I did not try to solve
problem with '%20' symbols, but I'll think about it shortly and this does
not seem a difficult problem.

Reto, please also note, that since you reply through @code.launchpad.net,
you reply will be publicly available at
https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385.
That's definitely not a problem, just pointing this out in case you have
not noticed.
-----
Best regards,
Grishkin Maxim

2012/10/16 Reto Knaak <email address hidden>:
> Hi Grishkin (Max?) !
>
> Thank you for the files... I'm not a real programmer (just tried to fix
> some issues that where avoiding me to use the script) so I don't know if I
> am the right person to make the code review.
> It makes now a lot of monts I didn't boot up my virtual box with ubuntu
and
> I have the feeling I forgot most of what I learned trying to fix the
script.
>
> Anyway, this evening I had some time and began to download the files -
just
> to see what's going on.
>
> My operating system is Win7, and if I open the CHM file from windows it
> won't open, probably doe to the missinf toc!
> Then I tryied to import it to calibre, and there if I open the CHM
> something is displayed but it the "Liberty Bay" article from the online(!)
> wikipedia.
> If I convert the file to some other format (mobi), I get a page with "1951
> Chicago Bears season" which seems me the right outpu...

Read more...

Revision history for this message
Reto Knaak (reto-knaak) wrote :

Hi Max!

You're welcome and thank you!
It's true that the project is abandoned... for me a pity because it does a
wonderful job for me!
Thank you for pointing out to me my replies are public, this is ok for me
(I guessed it).

I think it's a good idea to share your patch, and i suggest also to add the
bug officially (with a link to the google code bug page).
I also agree to merge the branches, but I'll need some help...

In the mean time I updated my ubuntu system and tried your chm eample file,
and as expected an error occured....

I started from the ubuntu version, and made my brach using this sequence:

patch chm2pdf < ../patches/chm2pdf_check_soup.diff
patch chm2pdf < ../patches/chm2pdf_no_javascript.diff
patch chm2pdf < ../patches/chm2pdf_multiple_page_problem.diff
patch chm2pdf < ../patches/chm2pdf_color_removed.diff
patch chm2pdf < ../patches/chm2pdf_links_case_insensitive.diff
patch chm2pdf < ../patches/chm2pdf_images_case_insensitive.diff
patch chm2pdf < ../patches/chm2pdf_specialchars.diff

Probably you need to appy all the patches.

Maybe it's also a good idea if you give a critical look at my patches,
chm2pdf is my first and only experience with python (and linux) and some of
my solutions may not be too clean...

Kind regards
Reto

lp:~grishkin/chm2pdf/chm2pdf_branch updated
15. By Grishkin <grishkin@mint>

merged several other Reto's patches

16. By Grishkin <grishkin@mint>

Fixed processing of links with spaces

Revision history for this message
Max Grishkin (grishkin) wrote :

Hi Reto!
I've reviewed your branch at
https://code.launchpad.net/~reto-knaak/chm2pdf/chm2pdf_branch and it seems
that it does not contain any of patches you mentioned, just changes in
changelog file. So I've merged these patches to my branch manually. I did
not test them deeply still and I am not especially confident in correctness
of merge of the latter two patches.
Seems that I've found the solution to the problem you've described at
https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/894193 - check the
latest comment there.
To merge the branches you'll have to install bzr somewhere on your local PC
and download your branch (if it is not already present there) by running:
bzr branch lp:~reto-knaak/chm2pdf/chm2pdf_branch
After that you merge branches by running:
bzr merge lp:~grishkin/chm2pdf/chm2pdf_branch
And upload the results back to launchpad:
bzr push lp:~reto-knaak/chm2pdf/chm2pdf_branch

Please feel free to contact me if you have any questions or if something is
not working.

-----
Best regards,
Grishkin Maxim

2012/10/24 Reto Knaak <email address hidden>

> Hi Max!
>
> You're welcome and thank you!
> It's true that the project is abandoned... for me a pity because it does a
> wonderful job for me!
> Thank you for pointing out to me my replies are public, this is ok for me
> (I guessed it).
>
> I think it's a good idea to share your patch, and i suggest also to add the
> bug officially (with a link to the google code bug page).
> I also agree to merge the branches, but I'll need some help...
>
> In the mean time I updated my ubuntu system and tried your chm eample file,
> and as expected an error occured....
>
> I started from the ubuntu version, and made my brach using this sequence:
>
> patch chm2pdf < ../patches/chm2pdf_check_soup.diff
> patch chm2pdf < ../patches/chm2pdf_no_javascript.diff
> patch chm2pdf < ../patches/chm2pdf_multiple_page_problem.diff
> patch chm2pdf < ../patches/chm2pdf_color_removed.diff
> patch chm2pdf < ../patches/chm2pdf_links_case_insensitive.diff
> patch chm2pdf < ../patches/chm2pdf_images_case_insensitive.diff
> patch chm2pdf < ../patches/chm2pdf_specialchars.diff
>
> Probably you need to appy all the patches.
>
> Maybe it's also a good idea if you give a critical look at my patches,
> chm2pdf is my first and only experience with python (and linux) and some of
> my solutions may not be too clean...
>
> Kind regards
> Reto
>
> --
> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385<https://code.launchpad.net/%7Egrishkin/chm2pdf/chm2pdf_branch/+merge/128385>
> You are the owner of lp:~grishkin/chm2pdf/chm2pdf_branch.
>

Revision history for this message
Reto Knaak (reto-knaak) wrote :

Hi Grishkin!

Nice to hear you, and thank you for your work!
It's possbile that I made some mistakes, I also saw the patches are not
there but assumed (as they are listed in /debian/patches/series) that they
would be applied at installation.

I started my virtualbox and executed the steps you described.
Unfortunately, the merge step gives me
bzr: ERROR: Not a branch: "/home/reto/".
So I'm stuck again...
Neverless, I checked your code against mine, and there are a few points to
discuss.

- in your version you don't use temporary directories (like the original
script version on google.code); in the ubuntu/devian version temporary
directories where iserted for security reasons (and against the will of the
developers with the result that they left the project). As this is a branch
on ubuntu, probably it would be good to stay with the temporary diretories
solution.
Personally I like the --dontextract because it's useful in debugging and
this is broken with temporary directories...

-Thank you for the suggestion for solving the %20 issue.
My suggestion is to use your solution only if --BeatifulSoup is used, and
if not, stay with current solution (It's only a minor problem for me).

- I'm not sure if line 522 should be commented, as it's now solved with
526:
  522: page = re.sub('(?i)"'+match_string, '"'+replace_string, page)
  526: page = re.sub(r'(?i)("|"[^\/"].*?\/)'+match_string,
'"'+replace_string, page)

-what is the difference between os.mkdir and os.makedirs? Am I right that
os.makedirs is safer to be used?

I still hope that someone will take charge of maintaining the chm2pdf
project, so that our shared efforts are not lost....
Kind regards!

Reto

Unmerged revisions

16. By Grishkin <grishkin@mint>

Fixed processing of links with spaces

15. By Grishkin <grishkin@mint>

merged several other Reto's patches

14. By Grishkin <grishkin@mint>

Merged Reto's pathes, fixed problems with spaces

13. By Grishkin <grishkin@mint>

Fixed processing of chm's without table of contents and fixed temp directory creation

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'LICENSE' (properties changed: -x to +x)
2=== modified file 'PKG-INFO' (properties changed: -x to +x)
3=== modified file 'README' (properties changed: -x to +x)
4=== modified file 'chm2pdf' (properties changed: -x to +x)
5--- chm2pdf 2008-08-05 19:39:01 +0000
6+++ chm2pdf 2012-11-04 13:57:19 +0000
7@@ -28,6 +28,7 @@
8 import re, glob
9 import getopt
10 # from BeautifulSoup import BeautifulSoup
11+import urllib
12
13 global version
14
15@@ -39,13 +40,30 @@
16 global filename #the input filename
17
18 version = '0.9.1'
19-CHM2PDF_TEMP_WORK_DIR='/tmp/chm2pdf/work'
20+CHM2PDF_TEMP_WORK_DIR='/tmp/chm2pdf/work'
21 CHM2PDF_TEMP_ORIG_DIR='/tmp/chm2pdf/orig'
22
23
24
25 # YOU DON'T NEED TO CHANGE ANYTHING BELOW THIS LINE!
26
27+def quote(s):
28+ return '\"' + s + '\"'
29+
30+def fix_spaces_in_links(page):
31+ try:
32+ from BeautifulSoup import BeautifulSoup
33+ except ImportError:
34+ print "BeautifulSoup not installed: links with spaces will not work correctly!"
35+ return page
36+ soup = BeautifulSoup(page)
37+ for link in soup.findAll({'a': True, 'img': True}):
38+ try:
39+ link['href'] = urllib.unquote(link['href'])
40+ link['src'] = urllib.unquote(link['src'])
41+ except KeyError:
42+ pass
43+ return str(soup)
44
45 class PageLister(sgmllib.SGMLParser):
46 '''
47@@ -55,15 +73,20 @@
48 def reset(self):
49 sgmllib.SGMLParser.reset(self)
50 self.pages=[]
51-
52+
53 def start_param(self,attrs):
54 urlparam_flag=False
55 for key,value in attrs:
56 if key=='name' and value=='Local':
57 urlparam_flag=True
58 if urlparam_flag and key=='value':
59- self.pages.append('/'+value)
60-
61+ # self.pages.append('/'+value)
62+ # Avoid duplicates in the list of URLs.
63+ if not self.pages.count('/'+value):
64+ self.pages.append('/'+value)
65+
66+
67+
68 class ImageCatcher(sgmllib.SGMLParser):
69 '''
70 finds image urls in the current html page, so to take them out from the chm file.
71@@ -71,14 +94,14 @@
72 def reset(self):
73 sgmllib.SGMLParser.reset(self)
74 self.imgurls=[]
75-
76+
77 def start_img(self,attrs):
78 for key,value in attrs:
79 if key=='src' or key=='SRC':
80 # Avoid duplicates in the list of image URLs.
81 if not self.imgurls.count(value):
82 self.imgurls.append(value)
83-
84+
85 class CssCatcher(sgmllib.SGMLParser):
86 '''
87 finds CSS urls in the current html page, so to take them out from the chm file.
88@@ -86,7 +109,7 @@
89 def reset(self):
90 sgmllib.SGMLParser.reset(self)
91 self.cssurls=[]
92-
93+
94 def start_link(self,attrs):
95 for key,value in attrs:
96 if key=='href' or key=='HREF':
97@@ -100,23 +123,31 @@
98 (actually performed by the PageLister class)
99 '''
100 topicstree=cfile.GetTopicsTree()
101- lister=PageLister()
102- lister.feed(topicstree)
103- #print 'lister pages',lister.pages
104- return lister.pages
105-
106-def get_objective_urls_list(filename):
107+ if topicstree is not None:
108+ lister=PageLister()
109+ lister.feed(topicstree)
110+ #print 'lister pages',lister.pages
111+ return lister.pages
112+
113+ topicstree = get_objective_urls_list(cfile.filename, lambda s: s.endswith(('.htm', '.html')))
114+ if topicstree is None:
115+ raise RuntimeError('Html files not found inside chm file, nothing to convert!')
116+ return topicstree
117+
118+
119+def get_objective_urls_list(filename, cond = lambda x: True):
120 '''
121 takes the list of files inside the chm archive, with the correct urls of each one.
122 '''
123-
124- os.system('enum_chmLib '+filename+' > '+CHM2PDF_WORK_DIR+'/urlslist.txt')
125+ cmd = 'enum_chmLib '+quote(filename)+' > ' + quote(CHM2PDF_WORK_DIR) + '/urlslist.txt'
126+ os.system(cmd)
127 flist=open(CHM2PDF_WORK_DIR+'/urlslist.txt','rU')
128 urls_list=[]
129 for line in flist.readlines()[3:]:
130 #print 'line',line
131- spline=line.split()
132- urls_list.append(spline[5])
133+ spline = re.sub(r".*?normal file\s*(.*?)\n$", "\\1", line)
134+ if cond(spline) and spline[0]=="/":
135+ urls_list.append(spline)
136 flist.close()
137 # os.remove(CHM2PDF_WORK_DIR+'/urlslist.txt')
138
139@@ -129,33 +160,45 @@
140 pf=open(input_file,'rU')
141 page=pf.read()
142 pf.close()
143-
144+
145 # Correct the HTML markup of the page, if the --beautifulsoup was passed.
146 if options['beautifulsoup']=='--beautifulsoup':
147- from BeautifulSoup import BeautifulSoup, Tag
148+ try:
149+ from BeautifulSoup import BeautifulSoup
150+ except ImportError as e:
151+ print
152+ print '### An error occured importing soup ', e
153+ print '### Check if beautifulsoup is installed or remove --beautifulsoup from the command line'
154+ sys.exit()
155+
156 soup = BeautifulSoup(page)
157 page = str(soup)
158
159 image_catcher=ImageCatcher()
160 image_catcher.feed(page)
161-
162+
163 css_catcher=CssCatcher()
164 css_catcher.feed(page)
165-
166+
167 # We substitute the image URLs of input_file with the *actual* URLs on the CHM2PDF_ORIG_DIR directory
168 for iurl in image_catcher.imgurls:
169 # print 'iurl = ' + iurl
170
171 img_filename = ''
172 for item in objective_urls:
173- if iurl in item:
174+ #objective_urls has "real path", whereas image_catcher.imgurls can contain %20!
175+ #e.g. item='/doc space/image path/velocity space.gif iurl=image%20path/velocity%20space.gif
176+ iiurl= re.sub('%20',' ',iurl)
177+ if iiurl in item:
178 img_filename=CHM2PDF_ORIG_DIR+item
179 if ';' in img_filename: #hack to get rid of mysterious ; in filenames and urls...
180 img_filename=img_filename.split(';')[0]
181 # substitute the new image filenames - but only if an img_filename was found!
182+ # added (?i) modifier to make a case insensitive match for not breaking working links to images in windows in CHM files
183 if img_filename:
184- page=re.sub(iurl,img_filename,page)
185-
186+ #r = Python also has "raw strings" which do not apply special treatment to backslashes
187+ page=re.sub(r'(?i)"'+iurl,'"'+re.sub('\\\\ ', ' ', img_filename),page)
188+
189
190 # We substitute the CSS URLs of input_file with the *actual* URLs on the CHM2PDF_ORIG_DIR directory
191 for curl in css_catcher.cssurls:
192@@ -170,10 +213,10 @@
193 # substitute the new image filenames - but only if a css_filename was found!
194 if css_filename:
195 page=re.sub(curl,css_filename,page)
196-
197+
198 # Fontsize hack:
199 # Since htmldoc ignores the --fontsize option, we have to do something about it...
200- # If --fontsize xxx was given on the command line,
201+ # If --fontsize xxx was given on the command line,
202 # insert <font> and </font> tags between <p> and </p>.
203 # While doing so, use xxx as the value of the size attribute of the font tag.
204 if options['fontsize']:
205@@ -199,6 +242,10 @@
206 page=re.sub('"[^"]*prev\.gif"','""', page)
207 page=re.sub('"[^"]*next\.gif"','""', page)
208
209+ # Delete javascript (<script type='text/javascript'>...</script>)
210+ page=re.sub('(?i)<script[^>]*>(.*?)</script>','', page, flags=re.DOTALL|re.MULTILINE)
211+
212+
213 # Delete CSS markup (<link rel="stylesheet"...)
214 # Currently, htmldoc chokes on CSS. In some distant, bright future things will be different, but until then...
215 # I know, it is silly to try to correct the CSS URLs as above, only to delete them here, just a few lines later.
216@@ -299,29 +346,28 @@
217 # ########################### File extraction and correction: START ############################
218 #
219 if options['dontextract'] == '':
220-
221- try:
222- os.mkdir(CHM2PDF_TEMP_WORK_DIR)
223- except OSError: # The directory already exists.
224- pass
225-
226- try:
227- os.mkdir(CHM2PDF_TEMP_ORIG_DIR)
228- except OSError: # The directory already exists.
229- pass
230-
231- try:
232- os.mkdir(CHM2PDF_ORIG_DIR)
233- except OSError: # The directory already exists.
234- pass
235-
236- try:
237- os.mkdir(CHM2PDF_WORK_DIR)
238- except OSError: # The directory already exists.
239- pass
240-
241+ try:
242+ os.makedirs(CHM2PDF_TEMP_WORK_DIR)
243+ except OSError: # The directory already exists.
244+ pass
245+
246+ try:
247+ os.makedirs(CHM2PDF_TEMP_ORIG_DIR)
248+ except OSError: # The directory already exists.
249+ pass
250+
251+ try:
252+ os.makedirs(CHM2PDF_ORIG_DIR)
253+ except OSError: # The directory already exists.
254+ pass
255+
256+ try:
257+ os.makedirs(CHM2PDF_WORK_DIR)
258+ except OSError: # The directory already exists.
259+ pass
260+
261 # Compute filenames and lists. This is needed no matter if '--dontextract' was given or not!
262-
263+
264 html_list=get_html_list(cfile)
265 objective_urls=get_objective_urls_list(filename)
266
267@@ -334,15 +380,15 @@
268 # print html_list
269
270 true_html_list=[] #Should mostly coincide with html_list, but...
271-
272- input_titlefile = ''
273- output_titlefile = ''
274+
275+ input_titlefile = ''
276+ output_titlefile = ''
277 for html_file in html_list:
278 for item in objective_urls:
279 if html_file in item:
280 true_html_list.append(CHM2PDF_ORIG_DIR+item)
281 if not options['titlefile']=='' and options['titlefile'] in item:
282- input_titlefile = CHM2PDF_ORIG_DIR+item
283+ input_titlefile = CHM2PDF_ORIG_DIR+re.escape(item)
284 output_titlefile = CHM2PDF_WORK_DIR + os.sep + options['titlefile']
285
286 if not options['titlefile']=='' and not output_titlefile:
287@@ -354,13 +400,13 @@
288
289
290 # Process toc file. This depends on the '--dontextract' option.
291-
292+
293 if options['dontextract'] == '':
294 # Correct image links in toc file.
295 if not options['titlefile']=='' and os.path.exists(input_titlefile):
296 correct_file(input_titlefile, output_titlefile, html_list, objective_urls, options)
297
298-
299+
300 # Now process the rest of HTML files.
301
302 # Compute some lists. Again, this is independent of the '--dontextract' option.
303@@ -379,16 +425,16 @@
304 # Some names contain a '%20' (an HTML code for a space). We substitute with a "real space"
305 # otherwise a 'File not found' error will occur.
306 page_filename = re.sub('%20',' ',page_filename)
307-
308+
309 if options['verbose']=='--verbose' and options['verbositylevel']=='high' and options['dontextract'] == '':
310 print "Correcting " + page_filename
311
312-
313+
314 if os.path.exists(page_filename) and (options['titlefile'] == '' or not options['titlefile'] in url):
315 htmlout_filename=CHM2PDF_WORK_DIR+'/temp'+'%(#)04d' %{"#":c}+'.html'
316- htmlout_filename_list+=' '+ htmlout_filename
317+ htmlout_filename_list+=' '+ quote(htmlout_filename)
318 htmlout_filenames.append(htmlout_filename)
319-
320+
321 if options['dontextract'] == '':
322 # Correct image links in file page_filename.
323 correct_file(page_filename, htmlout_filename, html_list, objective_urls, options)
324@@ -397,6 +443,10 @@
325 url_filename_escaped = re.sub('/', '\/', os.path.basename(url))
326 # Escape dots in url.
327 url_filename_escaped = re.sub('\.', '\.', url_filename_escaped)
328+ # Escape ( in url.
329+ url_filename_escaped = re.sub('\(', '\(', url_filename_escaped)
330+ # Escape ) in url.
331+ url_filename_escaped = re.sub('\)', '\)', url_filename_escaped)
332 # Escape slashes in htmlout_filename.
333 htmlout_filename_escaped = re.sub('/', '\/', os.path.basename(htmlout_filename))
334 # Compute a "garbled" htmlout_filename, where dots are simply replaced with underscores.
335@@ -421,12 +471,12 @@
336 # tol.html -> temp0001.html -> temptemp0002.html -> temptemptemp0003.html ...
337 # 0001.html -> temp0002.html -> temptemp0003.html -> temptemptemp0004.html ...
338 # ...
339- #
340+ #
341 # which is not what we want.
342 match_strings.append(url_filename_escaped)
343 replace_strings.append(htmlout_filename_escaped)
344 replace_garbled_strings.append(htmlout_filename_escaped_garbled)
345-
346+
347 # Now we've got the lists computed. We proceed with the actual correction,
348 # which IS dependent on the '--dontextract' option:
349
350@@ -434,7 +484,7 @@
351 # Correct links to files in the local collection.
352 if options['verbose']=='--verbose' and options['verbositylevel']=='low':
353 print 'Correcting links in the HTML files...'
354-
355+
356 if options['verbose']=='--verbose' and options['verbositylevel']=='high':
357 print '############### 1st pass ###############'
358 for match_string in match_strings:
359@@ -443,7 +493,7 @@
360 print "match " + match_string + ' ' + "and replace it with " + replace_string
361 if options['verbose']=='--verbose' and options['verbositylevel']=='high':
362 print
363-
364+
365 if options['verbose']=='--verbose' and options['verbositylevel']=='high':
366 print '############### 2nd pass ###############'
367 for match_string in replace_garbled_strings:
368@@ -452,38 +502,51 @@
369 print "match " + match_string + ' ' + "and replace it with " + replace_string
370 if options['verbose']=='--verbose' and options['verbositylevel']=='high':
371 print
372-
373+
374 for filename in htmlout_filenames:
375-
376+
377 pf=open(filename,'rU')
378 page=pf.read()
379 pf.close()
380-
381+
382+ # Some names contain a '%20' (an HTML code for a space). We substitute with a "real space"
383+ # otherwise we won't be able to match to the real files.
384+ page = fix_spaces_in_links(page)
385+
386 # Substitutions in 1st pass: we replace the original filenames with their corresponding "garbled" equivalents.
387+ # added (?i) modifier to make a case insensitive match for not breaking working links on windows in CHM files
388+ # added " to the match criteria to avoid wrong match (eg this.htm matched also do_this.htm before)
389 for match_string in match_strings:
390 replace_string = replace_garbled_strings[match_strings.index(match_string)]
391- page = re.sub(match_string, replace_string, page)
392-
393-
394+ page = re.sub('(?i)"'+match_string, '"'+replace_string, page)
395+ #remove also path before..
396+ #eg: ..\other path\this.htm should match "this.htm" but not "do_this.htm"
397+ #this should match "matchstring or "some path\matchstring
398+ page = re.sub(r'(?i)("|"[^\/"].*?\/)'+match_string, '"'+replace_string, page)
399+ #what if in different paths we have files with the same name?
400+
401 # Substitutuions in the 2nd pass: we replace the garbled filenames with the correct ones.
402 for match_string in replace_garbled_strings:
403 replace_string = replace_strings[replace_garbled_strings.index(match_string)]
404 page = re.sub(match_string, replace_string, page)
405-
406- # Replace links of the form "somefile.html#894" with "somefile0206.html"
407+
408+ # Replace links of the form "somefile.html#894" with "somefile0206.html"
409 # The following will match anchors like '<a href="temp0206.html#894"' and will store the 'temp0206.html' in backreference 1.
410 # The replace string will then replace it with '<a href="temp0206.html"', i.e. it will take away the '#894' part.
411- # This is because the numbers after the '#' are often wrong or non-existent. It is better to link to an existing
412+ # This is because the numbers after the '#' are often wrong or non-existent. It is better to link to an existing
413 # chapter than to a non-existent part of an existing chapter.
414- page = re.sub('<a href="([^#]*)#[^"]*"', '<a href="\\1"', page)
415-
416+ # page = re.sub('(?i)<a href="([^#|"]*)#[^"]*"', '<a href="\\1"', page)
417+ # This leaves internal page links of the form "#.." intact
418+ page = re.sub('(?i)<a href="([^#|"]+)#[^"]*"', '<a href="\\1"', page)
419+
420 pf=open(filename,'w')
421 pf.write(page)
422- pf.close
423+ pf.flush()
424+ pf.close()
425
426 # Here ends the extraction and correction of the HTML files which, as said above,
427 # will take place ONLY IF '--dontextract' was NOT given.
428- # If '--dontextract' was given, only the file lists like htmlout_filename_list
429+ # If '--dontextract' was given, only the file lists like htmlout_filename_list
430 # were computed above, but no file extraction or correction took place.
431 #
432 # ########################### File extraction and correction: END ############################
433@@ -588,18 +651,19 @@
434 elif key=='user-password': htmldoc_opts += ' --user-password ' + value
435 elif key=='version': htmldoc_opts += ' ' + value
436 elif key=='webpage': htmldoc_opts += ' ' + value
437-
438+
439+ cmd = 'htmldoc' + htmldoc_opts + ' ' + htmlout_filename_list + " -f "+ quote(outputfilename) + " > /dev/null"
440 if options['verbose']=='--verbose' and options['verbositylevel']=='high':
441- print 'htmldoc' + htmldoc_opts + ' ' + htmlout_filename_list + " -f "+ outputfilename + " > /dev/null"
442- exit_value=os.system ('htmldoc' + htmldoc_opts + ' ' + htmlout_filename_list + " -f "+ outputfilename + " > /dev/null")
443+ print cmd
444+ exit_value=os.system (cmd)
445
446 if exit_value != 0:
447 print 'Something wrong happened when launching htmldoc.'
448 print 'exit value: ',exit_value
449 print 'Check if output exists or if it is good.'
450- else:
451+ else:
452 print 'Written file ' + outputfilename
453- print 'Done.'
454+ print 'Done.'
455
456 def usage (name):
457 print 'Usage:'
458@@ -832,7 +896,7 @@
459 options['webpage'] = ''
460
461 try:
462- opts, args = getopt.getopt(sys.argv[1:], "f:t:v:",
463+ opts, args = getopt.getopt(sys.argv[1:], "f:t:v:",
464 [
465 "beautifulsoup",
466 "bodycolor=",
467@@ -930,7 +994,7 @@
468 except getopt.GetoptError:
469 usage(sys.argv[0])
470 sys.exit(1)
471-
472+
473 for o, a in opts:
474 if o == '--beautifulsoup': options['beautifulsoup'] = '--beautifulsoup'
475 elif o == '--bodycolor': options['bodycolor'] = a
476@@ -1022,7 +1086,7 @@
477 elif o == '--user-password': options['user-password'] = a
478 elif o in ('-v', '--verbose'): options['verbose'] = '--verbose'
479 elif o == '--verbositylevel': options['verbositylevel'] = a
480- elif o == '--version':
481+ elif o == '--version':
482 print sys.argv[0] + ' version ' + version
483 print 'This is free software; see the source for copying conditions. There is NO'
484 print 'warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.'
485@@ -1041,7 +1105,7 @@
486 return
487 #
488 # One of '--book' or '--webpage' MUST be given!
489- if options['extract-only'] == '' and ((options['book'] == '' and options['webpage'] == '' and options['continuous'] == '') or
490+ if options['extract-only'] == '' and ((options['book'] == '' and options['webpage'] == '' and options['continuous'] == '') or
491 (options['book'] == '--book' and options['webpage'] == '--webpage') or
492 (options['book'] == '--book' and options['continuous'] == '--continuous') or
493 (options['webpage'] == '--webpage' and options['continuous'] == '--continuous')):
494@@ -1057,7 +1121,7 @@
495 return
496 elif len(args)==1:
497 filename = args[0]
498- dirname, basename, suffix = split(filename)
499+ dirname, basename, suffix = split(filename)
500 if dirname:
501 outputfilename = dirname + os.sep + basename +'.pdf'
502 else:
503@@ -1072,7 +1136,7 @@
504 else:
505 usage(sys.argv[0])
506 return
507-
508+
509 CHM2PDF_WORK_DIR = CHM2PDF_TEMP_WORK_DIR + os.sep + basename
510 CHM2PDF_ORIG_DIR = CHM2PDF_TEMP_ORIG_DIR + os.sep + basename
511
512@@ -1083,14 +1147,16 @@
513 if not os.path.exists(filename):
514 print 'CHM file "' + filename + '" not found!'
515 return
516-
517+
518 #remove temporary files
519 if options['dontextract'] == '':
520 if options['verbose']=='--verbose' and options['verbositylevel']=='high':
521 print 'Removing any previous temporary files...'
522- os.system('rm -r '+CHM2PDF_ORIG_DIR+'/*')
523- os.system('rm -r '+CHM2PDF_WORK_DIR+'/*')
524-
525+ try:
526+ os.rmdir(CHM2PDF_ORIG_DIR)
527+ os.rmdir(CHM2PDF_WORK_DIR)
528+ except OSError:
529+ pass
530 cfile = chm.CHMFile()
531 cfile.LoadCHM(filename)
532
533@@ -1100,13 +1166,13 @@
534 print 'Will use the files in ' + CHM2PDF_ORIG_DIR + ' and ' + CHM2PDF_WORK_DIR + '.'
535 else:
536 if options['verbose'] == '--verbose' and options['verbositylevel'] == 'high':
537- os.system('extract_chmLib ' + filename + ' ' + CHM2PDF_ORIG_DIR)
538+ os.system('extract_chmLib ' + quote(filename) + ' ' + quote(CHM2PDF_ORIG_DIR))
539 else:
540- os.system('extract_chmLib ' + filename + ' ' + CHM2PDF_ORIG_DIR + '&> /dev/null')
541-
542+ os.system('extract_chmLib ' + quote(filename) + ' ' + quote(CHM2PDF_ORIG_DIR) + '&> /dev/null')
543+
544 convert_to_pdf(cfile, filename, outputfilename, options)
545
546
547 if __name__ == '__main__':
548 main(sys.argv)
549-
550+
551
552=== modified file 'debian/README.source' (properties changed: -x to +x)
553=== modified file 'debian/changelog' (properties changed: -x to +x)
554=== modified file 'debian/chm2pdf.1' (properties changed: -x to +x)
555=== modified file 'debian/chm2pdf.manpages' (properties changed: -x to +x)
556=== modified file 'debian/compat' (properties changed: -x to +x)
557=== modified file 'debian/control' (properties changed: -x to +x)
558=== modified file 'debian/copyright' (properties changed: -x to +x)
559=== modified file 'debian/patches/multi_filename_fix.diff' (properties changed: -x to +x)
560=== modified file 'debian/patches/series' (properties changed: -x to +x)
561=== modified file 'debian/pycompat' (properties changed: -x to +x)
562=== modified file 'debian/pyversions' (properties changed: -x to +x)
563=== modified file 'debian/watch' (properties changed: -x to +x)
564=== modified file 'setup.py' (properties changed: -x to +x)

Subscribers

People subscribed via source and target branches