lp:~mgedmin/+junk/pdf2html

Created by Marius Gedminas and last modified

Python wrapper around pdftohtml (from poppler-utils) that tries hard to preserve paragraphs.

Get this branch:
bzr branch lp:~mgedmin/+junk/pdf2html
Only Marius Gedminas can upload to this branch. If you are Marius Gedminas please log in for upload directions.

Related bugs

Related blueprints

Branch information

Owner:
Marius Gedminas
Status:
Mature

Recent revisions

50. By Marius Gedminas

Suppress some flake8 warnings

49. By Marius Gedminas

New option: --encoding

Fix extra spaces after superscript.

And I can't easily untangle these two commits because bzr is not git.

48. By Marius Gedminas

Better superscript handling logic.

Doesn't mistakenly join footnotes with text on the next page.

47. By Marius Gedminas

Handle superscript!

Also handle hyphenated Lithuanian words.

46. By Marius Gedminas

Make it possible to override guessed horiz position, indent, leeway.

45. By Marius Gedminas

Show topmost and bottommost coordinates with --debug.

Helps the user discover the value of --header-pos or --footer-pos.

44. By Marius Gedminas

New option: --leading

43. By Marius Gedminas

Found a book that needs this leeway, I think.

42. By Marius Gedminas

Pass -hidden and -nodrm to pdftohtml.

The -hidden option lets you see copy-pasteable text when the PDF itself is
just a bunch of bitmaps.

41. By Marius Gedminas

Do not crash if the PDF doesn't have any actual text in it.

Branch metadata

Branch format:
Branch format 6
Repository format:
Bazaar pack repository format 1 with rich root (needs bzr 1.0)
This branch contains Public information 
Everyone can see this information.

Subscribers