lp:~mgedmin/+junk/pdf2html

Created by Marius Gedminas on 2009-08-28 and last modified on 2013-07-18

Python wrapper around pdftohtml (from poppler-utils) that tries hard to preserve paragraphs.

Get this branch:
bzr branch lp:~mgedmin/+junk/pdf2html
Only Marius Gedminas can upload to this branch. If you are Marius Gedminas please log in for upload directions.

Related bugs

Related blueprints

Branch information

Owner:
Marius Gedminas
Status:
Mature

Recent revisions

50. By Marius Gedminas on 2013-07-18

Suppress some flake8 warnings

49. By Marius Gedminas on 2013-07-18

New option: --encoding

Fix extra spaces after superscript.

And I can't easily untangle these two commits because bzr is not git.

48. By Marius Gedminas on 2013-07-10

Better superscript handling logic.

Doesn't mistakenly join footnotes with text on the next page.

47. By Marius Gedminas on 2013-07-10

Handle superscript!

Also handle hyphenated Lithuanian words.

46. By Marius Gedminas on 2013-07-10

Make it possible to override guessed horiz position, indent, leeway.

45. By Marius Gedminas on 2013-06-05

Show topmost and bottommost coordinates with --debug.

Helps the user discover the value of --header-pos or --footer-pos.

44. By Marius Gedminas on 2013-06-03

New option: --leading

43. By Marius Gedminas on 2012-11-17

Found a book that needs this leeway, I think.

42. By Marius Gedminas on 2012-05-02

Pass -hidden and -nodrm to pdftohtml.

The -hidden option lets you see copy-pasteable text when the PDF itself is
just a bunch of bitmaps.

41. By Marius Gedminas on 2012-05-02

Do not crash if the PDF doesn't have any actual text in it.

Branch metadata

Branch format:
Branch format 6
Repository format:
Bazaar pack repository format 1 with rich root (needs bzr 1.0)
This branch contains Public information 
Everyone can see this information.

Subscribers