chm2pdf

Merge lp:~grishkin/chm2pdf/chm2pdf_branch into lp:~reto-knaak/chm2pdf/chm2pdf_branch

chm2pdf_branch
Merge into chm2pdf_branch

Proposed by Max Grishkin on 2012-10-07

Status:

Needs review

Proposed branch:

lp:~grishkin/chm2pdf/chm2pdf_branch

Merge into:

lp:~reto-knaak/chm2pdf/chm2pdf_branch

Diff against target:

564 lines (+160/-94)

1 file modified

chm2pdf (+160/-94)

To merge this branch:

bzr merge lp:~grishkin/chm2pdf/chm2pdf_branch

Undecided

Confirmed

Link a bug report

Reviewer	Review Type	Date Requested	Status
Reto Knaak		2012-10-07	Pending
Review via email: mp+128385@code.launchpad.net

Description of the change

Mainly fix from https://groups.google.com/d/topic/chm2pdf/SeOGMcMFsBw/discussion.

Revision history for this message

Reto Knaak (reto-knaak) wrote on 2012-10-15:

Download full text (3.7 KiB)

Hi Grishkin (Max?) !

Thank you for the files... I'm not a real programmer (just tried to fix
some issues that where avoiding me to use the script) so I don't know if I
am the right person to make the code review.
It makes now a lot of monts I didn't boot up my virtual box with ubuntu and
I have the feeling I forgot most of what I learned trying to fix the script.

Anyway, this evening I had some time and began to download the files - just
to see what's going on.

My operating system is Win7, and if I open the CHM file from windows it
won't open, probably doe to the missinf toc!
Then I tryied to import it to calibre, and there if I open the CHM
something is displayed but it the "Liberty Bay" article from the online(!)
wikipedia.
If I convert the file to some other format (mobi), I get a page with "1951
Chicago Bears season" which seems me the right output.
So I'm not sure the demo chm file has a valid output, but I agree that it's
a good idea to try to extract what's there.

I'm not familiar with the code review process, and I am asking myself if
i/we ahould open a bug under ubuntu (that is what I did with the bugs i
found previously)?

I gave a quick glance at the diff and most are differences that are not
really there (probably some spaces), and the only true differences are:
in def get_html_list(cfile) and def get_objective_urls_list(filename).

For the first one, it's the first time I see "lambda" (so again, probably
I'm not the right one to review...).
I think I understood what it's meant for but I can't say I understand how
it works.... (if first way to retrieve the html files fail, use the second
one using all files found or something similar?)

For the second one, "my local chm2pdf" is like this:

*def get_objective_urls_list(filename):
    '''
    takes the list of files inside the chm archive, with the correct urls
of each one.
    '''
    os.system('enum_chmLib '+filename+' >
"'+CHM2PDF_WORK_DIR+'/urlslist.txt"')
    flist=open(CHM2PDF_WORK_DIR+'/urlslist.txt','rU')
    urls_list=[]
    for line in flist.readlines()[3:]:
        #print 'line',line
        #This won't work if internal paths of CHM contains spaces: e.g.
/doc space/ will only become /doc
        #spline=line.split()
        #urls_list.append(spline[5])
        #this should work better:
        spline= re.sub(r".*?normal file\s*(.*?)\n$", "\\1", line)
        if spline[0]=="/":
          #print "got spline="+spline
          urls_list.append( spline)
    flist.close()
    # os.remove(CHM2PDF_WORK_DIR+'/urlslist.txt')
    return urls_list
*
Does your solution work with chm paths containing spaces? (If you need a
sample file see
https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/894193 )
I have the feeling (not really run any scripts this evening and forgetting
pyton) that using urls_list.append(spline[5]) will fail in case of paths
with spaces!
I have also the feeling that my solution is not really state of the art, so
maybe you can suggest something that solves both problems?

Hope to hear you soon and Kind regards from the italian part of Switzerland!

Ciao
Reto Knaak

On Sun, Oct 7, 2012 at 5:23 PM, Grishkin <email address hidden> wrote:

> Grishkin has propose...

Hi Grishkin (Max?) !

Anyway, this evening I had some time and began to download the files - just
to see what's going on.

I'm not familiar with the code review process, and I am asking myself if
i/we ahould open a bug under ubuntu (that is what I did with the bugs i
found previously)?

For the first one, it's the first time I see  "lambda" (so again, probably
I'm not the right one to review...).
I think I understood what it's meant for but I can't say I understand how
it works.... (if first way to retrieve the html files fail, use the second
one using all files found or something similar?)

For the second one, "my local chm2pdf" is like this:

*def get_objective_urls_list(filename):
    '''
    takes the list of files inside the chm archive, with the correct urls
of each one.
    '''
    os.system('enum_chmLib '+filename+' >
"'+CHM2PDF_WORK_DIR+'/urlslist.txt"')
    flist=open(CHM2PDF_WORK_DIR+'/urlslist.txt','rU')
    urls_list=[]
    for line in flist.readlines()[3:]:
        #print 'line',line
        #This won't work if internal paths of CHM contains spaces: e.g.
/doc space/ will only become /doc
        #spline=line.split()
        #urls_list.append(spline[5])
        #this should work better:
        spline= re.sub(r".*?normal file\s*(.*?)\n$", "\\1", line)
        if spline[0]=="/":
          #print "got spline="+spline
          urls_list.append( spline)
    flist.close()
    # os.remove(CHM2PDF_WORK_DIR+'/urlslist.txt')
    return urls_list
*
Does your solution work with chm paths containing spaces? (If you need a
sample file see
https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/894193 )
I have the feeling (not really run any scripts this evening and forgetting
pyton) that using  urls_list.append(spline[5]) will fail in case of paths
with spaces!
I have also the feeling that my solution is not really state of the art, so
maybe you can suggest something that solves both problems?

Hope to hear you soon and Kind regards from the italian part of Switzerland!

Ciao
Reto Knaak

On Sun, Oct 7, 2012 at 5:23 PM, Grishkin <MGrishkin@gmail.com> wrote:

> Grishkin has proposed merging lp:~grishkin/chm2pdf/chm2pdf_branch into
> lp:~reto-knaak/chm2pdf/chm2pdf_branch.
>
> Requested reviews:
>   Reto Knaak (reto-knaak)
>
> For more details, see:
> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385
>
> Mainly fix from
> https://groups.google.com/d/topic/chm2pdf/SeOGMcMFsBw/discussion.
> --
> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385
> You are requested to review the proposed merge of
> lp:~grishkin/chm2pdf/chm2pdf_branch into
> lp:~reto-knaak/chm2pdf/chm2pdf_branch.
>

lp:~grishkin/chm2pdf/chm2pdf_branch updated on 2012-10-19

14. By Grishkin <grishkin@mint> on 2012-10-19: Merged Reto's pathes, fixed problems with spaces

Revision history for this message

Max Grishkin (grishkin) wrote on 2012-10-19:

Download full text (6.2 KiB)

Hello, Reto, and thanks for constructive response!

I had been watching activity in chm2pdf google groups and on Launchpad for
a while and understand that you are just an ordinary user of chm2pdf, not a
maintainer or author of software. But I see that chm2pdf was published
quite a lot of time ago and until now there were no any bugfixes or
improvements, so the project may be considered abandoned. Some time ago you
were most active in project discussion and you own a branch of it on LP.
When I found this branch, I've decided to upload my chm2pdf version on
Launchpad too, just to make it public. And my point is that our branches
should be synchronized and both should have the latest chm2pdf version with
all available fixes.

Concerning that chm file from Google Group topic. It is certainly very
dirty and someone may argue that chm2pdf should not process such files
correctly. But it was created just to demonstrate the type of files, on
which chm2pdf failed before, and now it generates pdf's with them. So I've
just put two completely random Wikipedia articles into one chm file, they
even do not link to each other, that's why some software shows one page and
other software completely another page. Anyway, the resulting pdf contains
both pages.

I've downloaded your patch for spaces in names, but it appeared that I did
not understand to which version apply it - the one from distribution, from
code.google.com or from you branch? I've tried to merge it to different
versions by hand, but the resulting files still failed on chm file from
your Demo_CMH.zip. So I've just merged your patch to my branch and fixed
the rest so that conversion started to work for me - mainly added quotes
around filenames when needed. Seems that no-table-of-conents and
spaces-in-filenames fixes perfectly work together! I did not try to solve
problem with '%20' symbols, but I'll think about it shortly and this does
not seem a difficult problem.

Reto, please also note, that since you reply through @code.launchpad.net,
you reply will be publicly available at
https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385.
That's definitely not a problem, just pointing this out in case you have
not noticed.
-----
Best regards,
Grishkin Maxim

2012/10/16 Reto Knaak <email address hidden>:
> Hi Grishkin (Max?) !
>
> Thank you for the files... I'm not a real programmer (just tried to fix
> some issues that where avoiding me to use the script) so I don't know if I
> am the right person to make the code review.
> It makes now a lot of monts I didn't boot up my virtual box with ubuntu
and
> I have the feeling I forgot most of what I learned trying to fix the
script.
>
> Anyway, this evening I had some time and began to download the files -
just
> to see what's going on.
>
> My operating system is Win7, and if I open the CHM file from windows it
> won't open, probably doe to the missinf toc!
> Then I tryied to import it to calibre, and there if I open the CHM
> something is displayed but it the "Liberty Bay" article from the online(!)
> wikipedia.
> If I convert the file to some other format (mobi), I get a page with "1951
> Chicago Bears season" which seems me the right outpu...

Hello, Reto, and thanks for constructive response!

2012/10/16 Reto Knaak <reto.knaak@gmail.com>:
> Hi Grishkin (Max?) !
>
> Thank you for the files... I'm not a real programmer (just tried to fix
> some issues that where avoiding me to use the script) so I don't know if I
> am the right person to make the code review.
> It makes now a lot of monts I didn't boot up my virtual box with ubuntu
and
> I have the feeling I forgot most of what I learned trying to fix the
script.
>
> Anyway, this evening I had some time and began to download the files -
just
> to see what's going on.
>
> My operating system is Win7, and if I open the CHM file from windows it
> won't open, probably doe to the missinf toc!
> Then I tryied to import it to calibre, and there if I open the CHM
> something is displayed but it the "Liberty Bay" article from the online(!)
> wikipedia.
> If I convert the file to some other format (mobi), I get a page with "1951
> Chicago Bears season" which seems me the right output.
> So I'm not sure the demo chm file has a valid output, but I agree that
it's
> a good idea to try to extract what's there.
>
> I'm not familiar with the code review process, and I am asking myself if
> i/we ahould open a bug under ubuntu (that is what I did with the bugs i
> found previously)?
>
> I gave a quick glance at the diff and most are differences that are not
> really there (probably some spaces), and the only true differences are:
> in def get_html_list(cfile) and def get_objective_urls_list(filename).
>
> For the first one, it's the first time I see  "lambda" (so again, probably
> I'm not the right one to review...).
> I think I understood what it's meant for but I can't say I understand how
> it works.... (if first way to retrieve the html files fail, use the second
> one using all files found or something similar?)
>
> For the second one, "my local chm2pdf" is like this:
>
> *def get_objective_urls_list(filename):
>     '''
>     takes the list of files inside the chm archive, with the correct urls
> of each one.
>     '''
>     os.system('enum_chmLib '+filename+' >
> "'+CHM2PDF_WORK_DIR+'/urlslist.txt"')
>     flist=open(CHM2PDF_WORK_DIR+'/urlslist.txt','rU')
>     urls_list=[]
>     for line in flist.readlines()[3:]:
>         #print 'line',line
>         #This won't work if internal paths of CHM contains spaces: e.g.
> /doc space/ will only become /doc
>         #spline=line.split()
>         #urls_list.append(spline[5])
>         #this should work better:
>         spline= re.sub(r".*?normal file\s*(.*?)\n$", "\\1", line)
>         if spline[0]=="/":
>           #print "got spline="+spline
>           urls_list.append( spline)
>     flist.close()
>     # os.remove(CHM2PDF_WORK_DIR+'/urlslist.txt')
>     return urls_list
> *
> Does your solution work with chm paths containing spaces? (If you need a
> sample file see
> https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/894193 )
> I have the feeling (not really run any scripts this evening and forgetting
> pyton) that using  urls_list.append(spline[5]) will fail in case of paths
> with spaces!
> I have also the feeling that my solution is not really state of the art,
so
> maybe you can suggest something that solves both problems?
>
> Hope to hear you soon and Kind regards from the italian part of
Switzerland!
>
> Ciao
> Reto Knaak
>
>
> On Sun, Oct 7, 2012 at 5:23 PM, Grishkin <MGrishkin@gmail.com> wrote:
>
>> Grishkin has proposed merging lp:~grishkin/chm2pdf/chm2pdf_branch into
>> lp:~reto-knaak/chm2pdf/chm2pdf_branch.
>>
>> Requested reviews:
>>   Reto Knaak (reto-knaak)
>>
>> For more details, see:
>> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385
>>
>> Mainly fix from
>> https://groups.google.com/d/topic/chm2pdf/SeOGMcMFsBw/discussion.
>> --
>> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385
>> You are requested to review the proposed merge of
>> lp:~grishkin/chm2pdf/chm2pdf_branch into
>> lp:~reto-knaak/chm2pdf/chm2pdf_branch.
>>
>
> --
> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385
> You are the owner of lp:~grishkin/chm2pdf/chm2pdf_branch.

Revision history for this message

Reto Knaak (reto-knaak) wrote on 2012-10-23:

Hi Max!

You're welcome and thank you!
It's true that the project is abandoned... for me a pity because it does a
wonderful job for me!
Thank you for pointing out to me my replies are public, this is ok for me
(I guessed it).

I think it's a good idea to share your patch, and i suggest also to add the
bug officially (with a link to the google code bug page).
I also agree to merge the branches, but I'll need some help...

In the mean time I updated my ubuntu system and tried your chm eample file,
and as expected an error occured....

I started from the ubuntu version, and made my brach using this sequence:

patch chm2pdf < ../patches/chm2pdf_check_soup.diff
patch chm2pdf < ../patches/chm2pdf_no_javascript.diff
patch chm2pdf < ../patches/chm2pdf_multiple_page_problem.diff
patch chm2pdf < ../patches/chm2pdf_color_removed.diff
patch chm2pdf < ../patches/chm2pdf_links_case_insensitive.diff
patch chm2pdf < ../patches/chm2pdf_images_case_insensitive.diff
patch chm2pdf < ../patches/chm2pdf_specialchars.diff

Probably you need to appy all the patches.

Maybe it's also a good idea if you give a critical look at my patches,
chm2pdf is my first and only experience with python (and linux) and some of
my solutions may not be too clean...

Kind regards
Reto

lp:~grishkin/chm2pdf/chm2pdf_branch updated on 2012-11-04

15. By Grishkin <grishkin@mint> on 2012-10-27: merged several other Reto's patches
16. By Grishkin <grishkin@mint> on 2012-11-04: Fixed processing of links with spaces

Revision history for this message

Max Grishkin (grishkin) wrote on 2012-11-04:

Hi Reto!
I've reviewed your branch at
https://code.launchpad.net/~reto-knaak/chm2pdf/chm2pdf_branch and it seems
that it does not contain any of patches you mentioned, just changes in
changelog file. So I've merged these patches to my branch manually. I did
not test them deeply still and I am not especially confident in correctness
of merge of the latter two patches.
Seems that I've found the solution to the problem you've described at
https://bugs.launchpad.net/ubuntu/+source/chm2pdf/+bug/894193 - check the
latest comment there.
To merge the branches you'll have to install bzr somewhere on your local PC
and download your branch (if it is not already present there) by running:
bzr branch lp:~reto-knaak/chm2pdf/chm2pdf_branch
After that you merge branches by running:
bzr merge lp:~grishkin/chm2pdf/chm2pdf_branch
And upload the results back to launchpad:
bzr push lp:~reto-knaak/chm2pdf/chm2pdf_branch

Please feel free to contact me if you have any questions or if something is
not working.

-----
Best regards,
Grishkin Maxim

2012/10/24 Reto Knaak <email address hidden>

> Hi Max!
>
> You're welcome and thank you!
> It's true that the project is abandoned... for me a pity because it does a
> wonderful job for me!
> Thank you for pointing out to me my replies are public, this is ok for me
> (I guessed it).
>
> I think it's a good idea to share your patch, and i suggest also to add the
> bug officially (with a link to the google code bug page).
> I also agree to merge the branches, but I'll need some help...
>
> In the mean time I updated my ubuntu system and tried your chm eample file,
> and as expected an error occured....
>
> I started from the ubuntu version, and made my brach using this sequence:
>
> patch chm2pdf < ../patches/chm2pdf_check_soup.diff
> patch chm2pdf < ../patches/chm2pdf_no_javascript.diff
> patch chm2pdf < ../patches/chm2pdf_multiple_page_problem.diff
> patch chm2pdf < ../patches/chm2pdf_color_removed.diff
> patch chm2pdf < ../patches/chm2pdf_links_case_insensitive.diff
> patch chm2pdf < ../patches/chm2pdf_images_case_insensitive.diff
> patch chm2pdf < ../patches/chm2pdf_specialchars.diff
>
> Probably you need to appy all the patches.
>
> Maybe it's also a good idea if you give a critical look at my patches,
> chm2pdf is my first and only experience with python (and linux) and some of
> my solutions may not be too clean...
>
> Kind regards
> Reto
>
> --
> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385<https://code.launchpad.net/%7Egrishkin/chm2pdf/chm2pdf_branch/+merge/128385>
> You are the owner of lp:~grishkin/chm2pdf/chm2pdf_branch.
>

Please feel free to contact me if you have any questions or if something is
not working.

-----
Best regards,
Grishkin Maxim

2012/10/24 Reto Knaak <reto.knaak@gmail.com>

> Hi Max!
>
> You're welcome and thank you!
> It's true that the project is abandoned... for me a pity because it does a
> wonderful job for me!
> Thank you for pointing out to me my replies are public, this is ok for me
> (I guessed it).
>
> I think it's a good idea to share your patch, and i suggest also to add the
> bug officially (with a link to the google code bug page).
> I also agree to merge the branches, but I'll need some help...
>
> In the mean time I updated my ubuntu system and tried your chm eample file,
> and  as expected an error occured....
>
> I started from the ubuntu version, and made my brach using this sequence:
>
> patch chm2pdf < ../patches/chm2pdf_check_soup.diff
> patch chm2pdf < ../patches/chm2pdf_no_javascript.diff
> patch chm2pdf < ../patches/chm2pdf_multiple_page_problem.diff
> patch chm2pdf < ../patches/chm2pdf_color_removed.diff
> patch chm2pdf < ../patches/chm2pdf_links_case_insensitive.diff
> patch chm2pdf < ../patches/chm2pdf_images_case_insensitive.diff
> patch chm2pdf < ../patches/chm2pdf_specialchars.diff
>
> Probably you need to appy all the patches.
>
> Maybe it's also a good idea if you give a critical look at my patches,
> chm2pdf is my first and only experience with python (and linux) and some of
> my solutions may not be too clean...
>
> Kind regards
> Reto
>
> --
> https://code.launchpad.net/~grishkin/chm2pdf/chm2pdf_branch/+merge/128385<https://code.launchpad.net/%7Egrishkin/chm2pdf/chm2pdf_branch/+merge/128385>
> You are the owner of lp:~grishkin/chm2pdf/chm2pdf_branch.
>

Revision history for this message

Reto Knaak (reto-knaak) wrote on 2012-11-04:

Hi Grishkin!

Nice to hear you, and thank you for your work!
It's possbile that I made some mistakes, I also saw the patches are not
there but assumed (as they are listed in /debian/patches/series) that they
would be applied at installation.

I started my virtualbox and executed the steps you described.
Unfortunately, the merge step gives me
bzr: ERROR: Not a branch: "/home/reto/".
So I'm stuck again...
Neverless, I checked your code against mine, and there are a few points to
discuss.

- in your version you don't use temporary directories (like the original
script version on google.code); in the ubuntu/devian version temporary
directories where iserted for security reasons (and against the will of the
developers with the result that they left the project). As this is a branch
on ubuntu, probably it would be good to stay with the temporary diretories
solution.
Personally I like the --dontextract because it's useful in debugging and
this is broken with temporary directories...

-Thank you for the suggestion for solving the %20 issue.
My suggestion is to use your solution only if --BeatifulSoup is used, and
if not, stay with current solution (It's only a minor problem for me).

- I'm not sure if line 522 should be commented, as it's now solved with
526:
522: page = re.sub('(?i)"'+match_string, '"'+replace_string, page)
526: page = re.sub(r'(?i)("|"[^\/"].*?\/)'+match_string,
'"'+replace_string, page)

-what is the difference between os.mkdir and os.makedirs? Am I right that
os.makedirs is safer to be used?

I still hope that someone will take charge of maintaining the chm2pdf
project, so that our shared efforts are not lost....
Kind regards!

Reto

Unmerged revisions

16. By Grishkin <grishkin@mint> on 2012-11-04: Fixed processing of links with spaces
15. By Grishkin <grishkin@mint> on 2012-10-27: merged several other Reto's patches
14. By Grishkin <grishkin@mint> on 2012-10-19: Merged Reto's pathes, fixed problems with spaces
13. By Grishkin <grishkin@mint> on 2012-10-07: Fixed processing of chm's without table of contents and fixed temp directory creation

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Max Grishkin

Reto Knaak

Ubuntu Sponsors

 === modified file 'LICENSE' (properties changed: -x to +x)
 === modified file 'PKG-INFO' (properties changed: -x to +x)
 === modified file 'README' (properties changed: -x to +x)
 === modified file 'chm2pdf' (properties changed: -x to +x)
 --- chm2pdf	2008-08-05 19:39:01 +0000
 +++ chm2pdf	2012-11-04 13:57:19 +0000
@@ -28,6 +28,7 @@
  import re, glob
  import getopt
  # from BeautifulSoup import BeautifulSoup
++import urllib
  global version
@@ -39,13 +40,30 @@
  global filename #the input filename
  version = '0.9.1'
--CHM2PDF_TEMP_WORK_DIR='/tmp/chm2pdf/work'
++CHM2PDF_TEMP_WORK_DIR='/tmp/chm2pdf/work'
  CHM2PDF_TEMP_ORIG_DIR='/tmp/chm2pdf/orig'
  # YOU DON'T NEED TO CHANGE ANYTHING BELOW THIS LINE!
++def quote(s):
++    return '\"' + s + '\"'
++
++def fix_spaces_in_links(page):
++    try:
++        from BeautifulSoup import BeautifulSoup
++    except ImportError:
++        print "BeautifulSoup not installed: links with spaces will not work correctly!"
++        return page
++    soup = BeautifulSoup(page)
++    for link in soup.findAll({'a': True, 'img': True}):
++        try:
++            link['href'] =  urllib.unquote(link['href'])
++            link['src'] = urllib.unquote(link['src'])
++        except KeyError:
++            pass
++    return str(soup)
  class PageLister(sgmllib.SGMLParser):
      '''
@@ -55,15 +73,20 @@
      def reset(self):
          sgmllib.SGMLParser.reset(self)
          self.pages=[]
--
++
      def start_param(self,attrs):
         urlparam_flag=False
         for key,value in attrs:
             if key=='name' and value=='Local':
                 urlparam_flag=True
             if urlparam_flag and key=='value':
--               self.pages.append('/'+value)
--
++               # self.pages.append('/'+value)
++               # Avoid duplicates in the list of URLs.
++               if not self.pages.count('/'+value):
++                   self.pages.append('/'+value)
++
++
++
  class ImageCatcher(sgmllib.SGMLParser):
      '''
      finds image urls in the current html page, so to take them out from the chm file.
@@ -71,14 +94,14 @@
      def reset(self):
          sgmllib.SGMLParser.reset(self)
          self.imgurls=[]
--
++
      def start_img(self,attrs):
          for key,value in attrs:
              if key=='src' or key=='SRC':
                  # Avoid duplicates in the list of image URLs.
                  if not self.imgurls.count(value):
                      self.imgurls.append(value)
--
++
  class CssCatcher(sgmllib.SGMLParser):
      '''
      finds CSS urls in the current html page, so to take them out from the chm file.
@@ -86,7 +109,7 @@
      def reset(self):
          sgmllib.SGMLParser.reset(self)
          self.cssurls=[]
--
++
      def start_link(self,attrs):
          for key,value in attrs:
              if key=='href' or key=='HREF':
@@ -100,23 +123,31 @@
      (actually performed by the PageLister class)
      '''
      topicstree=cfile.GetTopicsTree()
--    lister=PageLister()
--    lister.feed(topicstree)
--    #print 'lister pages',lister.pages
--    return lister.pages
--
--def get_objective_urls_list(filename):
++    if topicstree is not None:
++        lister=PageLister()
++        lister.feed(topicstree)
++        #print 'lister pages',lister.pages
++        return lister.pages
++
++    topicstree = get_objective_urls_list(cfile.filename, lambda s: s.endswith(('.htm', '.html')))
++    if topicstree is None:
++        raise RuntimeError('Html files not found inside chm file, nothing to convert!')
++    return topicstree
++
++
++def get_objective_urls_list(filename, cond = lambda x: True):
      '''
      takes the list of files inside the chm archive, with the correct urls of each one.
      '''
--
--    os.system('enum_chmLib '+filename+' > '+CHM2PDF_WORK_DIR+'/urlslist.txt')
++    cmd = 'enum_chmLib '+quote(filename)+' > ' + quote(CHM2PDF_WORK_DIR) + '/urlslist.txt'
++    os.system(cmd)
      flist=open(CHM2PDF_WORK_DIR+'/urlslist.txt','rU')
      urls_list=[]
      for line in flist.readlines()[3:]:
          #print 'line',line
--        spline=line.split()
--        urls_list.append(spline[5])
++        spline = re.sub(r".*?normal file\s*(.*?)\n$", "\\1", line)
++        if cond(spline) and spline[0]=="/":
++            urls_list.append(spline)
      flist.close()
      # os.remove(CHM2PDF_WORK_DIR+'/urlslist.txt')
@@ -129,33 +160,45 @@
      pf=open(input_file,'rU')
      page=pf.read()
      pf.close()
--
++
      # Correct the HTML markup of the page, if the --beautifulsoup was passed.
      if options['beautifulsoup']=='--beautifulsoup':
--        from BeautifulSoup import BeautifulSoup, Tag
++        try:
++          from BeautifulSoup import BeautifulSoup
++        except ImportError as e:
++          print
++          print '### An error occured importing soup ', e
++          print '### Check if beautifulsoup is installed or remove --beautifulsoup from the command line'
++          sys.exit()
++
          soup = BeautifulSoup(page)
          page = str(soup)
      image_catcher=ImageCatcher()
      image_catcher.feed(page)
--
++
      css_catcher=CssCatcher()
      css_catcher.feed(page)
--
++
      # We substitute the image URLs of input_file with the *actual* URLs on the CHM2PDF_ORIG_DIR directory
      for iurl in image_catcher.imgurls:
          # print 'iurl = '  + iurl
          img_filename = ''
          for item in objective_urls:
--            if iurl in item:
++            #objective_urls has "real path", whereas image_catcher.imgurls can contain %20!
++            #e.g. item='/doc space/image path/velocity space.gif  iurl=image%20path/velocity%20space.gif
++            iiurl= re.sub('%20',' ',iurl)
++            if iiurl in item:
                  img_filename=CHM2PDF_ORIG_DIR+item
                  if ';' in img_filename: #hack to get rid of mysterious ; in filenames and urls...
                      img_filename=img_filename.split(';')[0]
          # substitute the new image filenames - but only if an img_filename was found!
++        # added (?i) modifier to make a case insensitive match for not breaking working links to images in windows in CHM files
          if img_filename:
--            page=re.sub(iurl,img_filename,page)
--
++            #r = Python also has "raw strings" which do not apply special treatment to backslashes
++            page=re.sub(r'(?i)"'+iurl,'"'+re.sub('\\\\ ', ' ', img_filename),page)
++
      # We substitute the CSS URLs of input_file with the *actual* URLs on the CHM2PDF_ORIG_DIR directory
      for curl in css_catcher.cssurls:
@@ -170,10 +213,10 @@
          # substitute the new image filenames - but only if a css_filename was found!
          if css_filename:
              page=re.sub(curl,css_filename,page)
--
++
      # Fontsize hack:
      # Since htmldoc ignores the --fontsize option, we have to do something about it...
--    # If --fontsize xxx was given on the command line,
++    # If --fontsize xxx was given on the command line,
      # insert <font> and </font> tags between <p> and </p>.
      # While doing so, use xxx as the value of the size attribute of the font tag.
      if options['fontsize']:
@@ -199,6 +242,10 @@
      page=re.sub('"[^"]*prev\.gif"','""', page)
      page=re.sub('"[^"]*next\.gif"','""', page)
++    # Delete javascript (<script type='text/javascript'>...</script>)
++    page=re.sub('(?i)<script[^>]*>(.*?)</script>','', page, flags=re.DOTALL|re.MULTILINE)
++
++
      # Delete CSS markup (<link rel="stylesheet"...)
      # Currently, htmldoc chokes on CSS. In some distant, bright future things will be different, but until then...
      # I know, it is silly to try to correct the CSS URLs as above, only to delete them here, just a few lines later.
@@ -299,29 +346,28 @@
      # ########################### File extraction and correction: START ############################
+     #
      if options['dontextract'] == '':
--
--        try:
--            os.mkdir(CHM2PDF_TEMP_WORK_DIR)
--        except OSError: # The directory already exists.
--            pass
--
--        try:
--            os.mkdir(CHM2PDF_TEMP_ORIG_DIR)
--        except OSError: # The directory already exists.
--            pass
--
--        try:
--            os.mkdir(CHM2PDF_ORIG_DIR)
--        except OSError: # The directory already exists.
--            pass
--
--        try:
--            os.mkdir(CHM2PDF_WORK_DIR)
--        except OSError: # The directory already exists.
--            pass
--
++        try:
++            os.makedirs(CHM2PDF_TEMP_WORK_DIR)
++        except OSError: # The directory already exists.
++            pass
++
++        try:
++            os.makedirs(CHM2PDF_TEMP_ORIG_DIR)
++        except OSError: # The directory already exists.
++            pass
++
++        try:
++            os.makedirs(CHM2PDF_ORIG_DIR)
++        except OSError: # The directory already exists.
++            pass
++
++        try:
++            os.makedirs(CHM2PDF_WORK_DIR)
++        except OSError: # The directory already exists.
++            pass
++
      # Compute filenames and lists. This is needed no matter if '--dontextract' was given or not!
--
++
      html_list=get_html_list(cfile)
      objective_urls=get_objective_urls_list(filename)
@@ -334,15 +380,15 @@
      # print html_list
      true_html_list=[] #Should mostly coincide with html_list, but...
--
--    input_titlefile = ''
--    output_titlefile = ''
++
++    input_titlefile = ''
++    output_titlefile = ''
      for html_file in html_list:
          for item in objective_urls:
              if html_file in item:
                  true_html_list.append(CHM2PDF_ORIG_DIR+item)
              if not options['titlefile']=='' and options['titlefile'] in item:
--                input_titlefile = CHM2PDF_ORIG_DIR+item
++                input_titlefile = CHM2PDF_ORIG_DIR+re.escape(item)
                  output_titlefile = CHM2PDF_WORK_DIR + os.sep + options['titlefile']
      if not options['titlefile']=='' and not output_titlefile:
@@ -354,13 +400,13 @@
      # Process toc file. This depends on the '--dontextract' option.
--
++
      if options['dontextract'] == '':
          # Correct image links in toc file.
          if not options['titlefile']=='' and os.path.exists(input_titlefile):
              correct_file(input_titlefile, output_titlefile, html_list, objective_urls, options)
--
++
      # Now process the rest of HTML files.
      # Compute some lists. Again, this is independent of the '--dontextract' option.
@@ -379,16 +425,16 @@
          # Some names contain a '%20' (an HTML code for a space). We substitute with a "real space"
          # otherwise a 'File not found' error will occur.
          page_filename = re.sub('%20',' ',page_filename)
--
++
          if options['verbose']=='--verbose' and options['verbositylevel']=='high' and options['dontextract'] == '':
              print "Correcting " + page_filename
--
++
          if os.path.exists(page_filename) and (options['titlefile'] == '' or not options['titlefile'] in url):
              htmlout_filename=CHM2PDF_WORK_DIR+'/temp'+'%(#)04d' %{"#":c}+'.html'
--            htmlout_filename_list+=' '+ htmlout_filename
++            htmlout_filename_list+=' '+ quote(htmlout_filename)
              htmlout_filenames.append(htmlout_filename)
--
++
              if options['dontextract'] == '':
                  # Correct image links in file page_filename.
                  correct_file(page_filename, htmlout_filename, html_list, objective_urls, options)
@@ -397,6 +443,10 @@
              url_filename_escaped = re.sub('/', '\/', os.path.basename(url))
              # Escape dots in url.
              url_filename_escaped = re.sub('\.', '\.', url_filename_escaped)
++            # Escape ( in url.
++            url_filename_escaped = re.sub('\(', '\(', url_filename_escaped)
++            # Escape ) in url.
++            url_filename_escaped = re.sub('\)', '\)', url_filename_escaped)
              # Escape slashes in htmlout_filename.
              htmlout_filename_escaped = re.sub('/', '\/', os.path.basename(htmlout_filename))
              # Compute a "garbled" htmlout_filename, where dots are simply replaced with underscores.
@@ -421,12 +471,12 @@
              # tol.html	-> temp0001.html -> temptemp0002.html -> temptemptemp0003.html ...
              # 0001.html	-> temp0002.html -> temptemp0003.html -> temptemptemp0004.html ...
              # ...
--            #
++            #
              # which is not what we want.
              match_strings.append(url_filename_escaped)
              replace_strings.append(htmlout_filename_escaped)
              replace_garbled_strings.append(htmlout_filename_escaped_garbled)
--
++
      # Now we've got the lists computed. We proceed with the actual correction,
      # which IS dependent on the '--dontextract' option:
@@ -434,7 +484,7 @@
          # Correct links to files in the local collection.
          if options['verbose']=='--verbose' and options['verbositylevel']=='low':
              print 'Correcting links in the HTML files...'
--
++
          if options['verbose']=='--verbose' and options['verbositylevel']=='high':
              print '############### 1st pass ###############'
          for match_string in  match_strings:
@@ -443,7 +493,7 @@
                  print "match " + match_string + ' ' + "and replace it with " + replace_string
          if options['verbose']=='--verbose' and options['verbositylevel']=='high':
              print
--
++
          if options['verbose']=='--verbose' and options['verbositylevel']=='high':
              print '############### 2nd pass ###############'
          for match_string in  replace_garbled_strings:
@@ -452,38 +502,51 @@
                  print "match " + match_string + ' ' + "and replace it with " + replace_string
          if options['verbose']=='--verbose' and options['verbositylevel']=='high':
              print
--
++
          for filename in htmlout_filenames:
--
++
              pf=open(filename,'rU')
              page=pf.read()
              pf.close()
--
++
++            # Some names contain a '%20' (an HTML code for a space). We substitute with a "real space"
++            # otherwise we won't be able to match to the real files.
++            page = fix_spaces_in_links(page)
++
              # Substitutions in 1st pass: we replace the original filenames with their corresponding "garbled" equivalents.
++            # added (?i) modifier to make a case insensitive match for not breaking working links on windows in CHM files
++            # added " to the match criteria to avoid wrong match (eg this.htm matched also do_this.htm before)
              for match_string in  match_strings:
                  replace_string = replace_garbled_strings[match_strings.index(match_string)]
--                page = re.sub(match_string, replace_string, page)
--
--
++                page = re.sub('(?i)"'+match_string, '"'+replace_string, page)
++                #remove also path before..
++                #eg: ..\other path\this.htm should match "this.htm" but not "do_this.htm"
++                #this should match "matchstring or "some path\matchstring
++                page = re.sub(r'(?i)("|"[^\/"].*?\/)'+match_string, '"'+replace_string, page)
++                #what if in different paths we have files with the same name?
++
              # Substitutuions in the 2nd pass: we replace the garbled filenames with the correct ones.
              for match_string in  replace_garbled_strings:
                  replace_string = replace_strings[replace_garbled_strings.index(match_string)]
                  page = re.sub(match_string, replace_string, page)
--
--            # Replace links of the form "somefile.html#894" with "somefile0206.html"
++
++            # Replace links of the form "somefile.html#894" with "somefile0206.html"
              # The following will match anchors like '<a href="temp0206.html#894"' and will store the 'temp0206.html' in backreference 1.
              # The replace string will then replace it with '<a href="temp0206.html"', i.e. it will take away the '#894' part.
--            # This is because the numbers after the '#' are often wrong or non-existent. It is better to link to an existing
++            # This is because the numbers after the '#' are often wrong or non-existent. It is better to link to an existing
              # chapter than to a non-existent part of an existing chapter.
--            page = re.sub('<a href="([^#]*)#[^"]*"', '<a href="\\1"', page)
--
++            # page = re.sub('(?i)<a href="([^#|"]*)#[^"]*"', '<a href="\\1"', page)
++            # This leaves internal page links of the form "#.." intact
++            page = re.sub('(?i)<a href="([^#|"]+)#[^"]*"', '<a href="\\1"', page)
++
              pf=open(filename,'w')
              pf.write(page)
--            pf.close
++            pf.flush()
++            pf.close()
      # Here ends the extraction and correction of the HTML files which, as said above,
      # will take place ONLY IF '--dontextract' was NOT given.
--    # If '--dontextract' was given, only the file lists like htmlout_filename_list
++    # If '--dontextract' was given, only the file lists like htmlout_filename_list
      # were computed above, but no file extraction or correction took place.
+     #
      # ########################### File extraction and correction: END   ############################
@@ -588,18 +651,19 @@
              elif key=='user-password': htmldoc_opts += ' --user-password ' + value
              elif key=='version': htmldoc_opts += ' ' + value
              elif key=='webpage': htmldoc_opts += ' ' + value
--
++
++    cmd = 'htmldoc' + htmldoc_opts + ' ' + htmlout_filename_list + " -f "+ quote(outputfilename) + " > /dev/null"
      if options['verbose']=='--verbose' and options['verbositylevel']=='high':
--        print 'htmldoc' + htmldoc_opts + ' ' + htmlout_filename_list + " -f "+ outputfilename + " > /dev/null"
--    exit_value=os.system ('htmldoc' + htmldoc_opts + ' ' + htmlout_filename_list + " -f "+ outputfilename + " > /dev/null")
++        print cmd
++    exit_value=os.system (cmd)
      if exit_value != 0:
          print 'Something wrong happened when launching htmldoc.'
          print 'exit value: ',exit_value
          print 'Check if output exists or if it is good.'
--    else:
++    else:
          print 'Written file ' + outputfilename
--    print 'Done.'
++    print 'Done.'
  def usage (name):
      print 'Usage:'
@@ -832,7 +896,7 @@
      options['webpage'] = ''
      try:
--        opts, args = getopt.getopt(sys.argv[1:], "f:t:v:",
++        opts, args = getopt.getopt(sys.argv[1:], "f:t:v:",
+                      [
                        "beautifulsoup",
                        "bodycolor=",
@@ -930,7 +994,7 @@
      except getopt.GetoptError:
          usage(sys.argv[0])
          sys.exit(1)
--
++
      for o, a in opts:
          if   o == '--beautifulsoup': options['beautifulsoup'] = '--beautifulsoup'
          elif o == '--bodycolor': options['bodycolor'] = a
@@ -1022,7 +1086,7 @@
          elif o == '--user-password': options['user-password'] = a
          elif o in ('-v', '--verbose'): options['verbose'] = '--verbose'
          elif o == '--verbositylevel': options['verbositylevel'] = a
--        elif o == '--version':
++        elif o == '--version':
              print sys.argv[0] + ' version ' + version
              print 'This is free software; see the source for copying conditions.  There is NO'
              print 'warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.'
@@ -1041,7 +1105,7 @@
          return
+     #
      # One of '--book' or '--webpage' MUST be given!
--    if options['extract-only'] == '' and ((options['book'] == '' and options['webpage'] == '' and options['continuous'] == '') or
++    if options['extract-only'] == '' and ((options['book'] == '' and options['webpage'] == '' and options['continuous'] == '') or
                                            (options['book'] == '--book' and options['webpage'] == '--webpage') or
                                            (options['book'] == '--book' and options['continuous'] == '--continuous') or
                                            (options['webpage'] == '--webpage' and options['continuous'] == '--continuous')):
@@ -1057,7 +1121,7 @@
          return
      elif len(args)==1:
          filename = args[0]
--        dirname, basename, suffix = split(filename)
++        dirname, basename, suffix = split(filename)
          if dirname:
              outputfilename = dirname + os.sep + basename +'.pdf'
          else:
@@ -1072,7 +1136,7 @@
      else:
          usage(sys.argv[0])
          return
--
++
      CHM2PDF_WORK_DIR = CHM2PDF_TEMP_WORK_DIR + os.sep + basename
      CHM2PDF_ORIG_DIR = CHM2PDF_TEMP_ORIG_DIR + os.sep + basename
@@ -1083,14 +1147,16 @@
      if not os.path.exists(filename):
          print 'CHM file "' + filename + '" not found!'
          return
--
++
      #remove temporary files
      if options['dontextract'] == '':
          if options['verbose']=='--verbose' and options['verbositylevel']=='high':
              print 'Removing any previous temporary files...'
--        os.system('rm -r '+CHM2PDF_ORIG_DIR+'/*')
--        os.system('rm -r '+CHM2PDF_WORK_DIR+'/*')
--
++        try:
++            os.rmdir(CHM2PDF_ORIG_DIR)
++            os.rmdir(CHM2PDF_WORK_DIR)
++        except OSError:
++            pass
      cfile = chm.CHMFile()
      cfile.LoadCHM(filename)
@@ -1100,13 +1166,13 @@
              print 'Will use the files in ' + CHM2PDF_ORIG_DIR + ' and ' + CHM2PDF_WORK_DIR + '.'
      else:
          if options['verbose'] == '--verbose' and options['verbositylevel'] == 'high':
--            os.system('extract_chmLib ' + filename + ' ' + CHM2PDF_ORIG_DIR)
++            os.system('extract_chmLib ' + quote(filename) + ' ' + quote(CHM2PDF_ORIG_DIR))
          else:
--            os.system('extract_chmLib ' + filename + ' ' + CHM2PDF_ORIG_DIR + '&> /dev/null')
--
++            os.system('extract_chmLib ' + quote(filename) + ' ' + quote(CHM2PDF_ORIG_DIR) + '&> /dev/null')
++
      convert_to_pdf(cfile, filename, outputfilename, options)
  if __name__ == '__main__':
      main(sys.argv)
--
++
 === modified file 'debian/README.source' (properties changed: -x to +x)
 === modified file 'debian/changelog' (properties changed: -x to +x)
 === modified file 'debian/chm2pdf.1' (properties changed: -x to +x)
 === modified file 'debian/chm2pdf.manpages' (properties changed: -x to +x)
 === modified file 'debian/compat' (properties changed: -x to +x)
 === modified file 'debian/control' (properties changed: -x to +x)
 === modified file 'debian/copyright' (properties changed: -x to +x)
 === modified file 'debian/patches/multi_filename_fix.diff' (properties changed: -x to +x)
 === modified file 'debian/patches/series' (properties changed: -x to +x)
 === modified file 'debian/pycompat' (properties changed: -x to +x)
 === modified file 'debian/pyversions' (properties changed: -x to +x)
 === modified file 'debian/watch' (properties changed: -x to +x)
 === modified file 'setup.py' (properties changed: -x to +x)

chm2pdf

Merge lp:~grishkin/chm2pdf/chm2pdf_branch into lp:~reto-knaak/chm2pdf/chm2pdf_branch

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers