Created by Marius Gedminas on 2009-08-28 and last modified on 2013-07-18

Python wrapper around pdftohtml (from poppler-utils) that tries hard to preserve paragraphs.

Get this branch:
bzr branch lp:~mgedmin/+junk/pdf2html
Only Marius Gedminas can upload to this branch. If you are Marius Gedminas please log in for upload directions.

Related bugs

Related blueprints

Branch information

Marius Gedminas

Recent revisions

50. By Marius Gedminas on 2013-07-18

Suppress some flake8 warnings

49. By Marius Gedminas on 2013-07-18

New option: --encoding

Fix extra spaces after superscript.

And I can't easily untangle these two commits because bzr is not git.

48. By Marius Gedminas on 2013-07-10

Better superscript handling logic.

Doesn't mistakenly join footnotes with text on the next page.

47. By Marius Gedminas on 2013-07-10

Handle superscript!

Also handle hyphenated Lithuanian words.

46. By Marius Gedminas on 2013-07-10

Make it possible to override guessed horiz position, indent, leeway.

45. By Marius Gedminas on 2013-06-05

Show topmost and bottommost coordinates with --debug.

Helps the user discover the value of --header-pos or --footer-pos.

44. By Marius Gedminas on 2013-06-03

New option: --leading

43. By Marius Gedminas on 2012-11-17

Found a book that needs this leeway, I think.

42. By Marius Gedminas on 2012-05-02

Pass -hidden and -nodrm to pdftohtml.

The -hidden option lets you see copy-pasteable text when the PDF itself is
just a bunch of bitmaps.

41. By Marius Gedminas on 2012-05-02

Do not crash if the PDF doesn't have any actual text in it.

Branch metadata

Branch format:
Branch format 6
Repository format:
Bazaar pack repository format 1 with rich root (needs bzr 1.0)
This branch contains Public information 
Everyone can see this information.