The module works pretty decent. I was able to extract text or generate images for several pdfs without any problems.
There are some minor things that should be discussed and/or fixed:
- the error seems to be too general, essentially it always raises JAVA-EXCEPTION no matter what goes wrong (e.g. it the given input is not a valid pdf)
- the java stack trace seems to be sent to standard error
- Renders the each page of the PDF document as an image. => Renders each page of the PDF document as an image.
- the names of the private functions should also adhere to the code conventions renderToImages => render-to-images
- make xqdoc failes because the comments seem to contain invalid xml
</home/mbrantner/zorba/build/URI_PATH/com/zorba-xquery/www/modules/project_xqdoc.xq>:142,9: user-defined error [err:UE004]: Error processing module zerr:ZXQD0002 - " This module provides funtionality to read the text from PDF documents and
to render PDF documents to images.
<a href="http://pdfbox.apache.org">Apache PDFBox</a> library is used to
implement these functions.
<br />
<br />
<b>Note:</b> Since this module has a Java library dependency a JVM required
to be installed on the system. For Windows: jvm.dll is required on the system
path ( usually located in "C:\Program Files\Java\jre6\bin\client".
<b>Note:<b> For Debian based Linux distributions install PdfBox and FontBox
packages: sudo apt-get install libpdfbox-java libfontbox-java
": can not parse as XML for xqdoc: loader parsing error: Opening and ending tag mismatch: b line 0 and root
; raised at /home/mbrantner/zorba/sandbox/src/runtime/errors_and_diagnostics/errors_and_diagnostics_impl.cpp:81
- adapt the year in "Copyright 2006-2009 The FLWOR Foundation." in the .xq file (and some other files also)
- would it make sense to return one string per page in the pdf instead of one big string?
- remove commented out code in read-pdf.cpp
- valgrind shows tons of invalid writes. Why? Are they critical? Is there anything we can do?
- would it make sense to return the images in a streaming fashion (i.e. don't create all base64's in a vector)?
- encoding each image shouldn't be necessary and will probably we wasted effort because the images might be written to a file in their binary form
The module works pretty decent. I was able to extract text or generate images for several pdfs without any problems.
There are some minor things that should be discussed and/or fixed:
- the error seems to be too general, essentially it always raises JAVA-EXCEPTION no matter what goes wrong (e.g. it the given input is not a valid pdf)
- the java stack trace seems to be sent to standard error
- Renders the each page of the PDF document as an image. => Renders each page of the PDF document as an image.
- the names of the private functions should also adhere to the code conventions renderToImages => render-to-images
- make xqdoc failes because the comments seem to contain invalid xml mbrantner/ zorba/build/ URI_PATH/ com/zorba- xquery/ www/modules/ project_ xqdoc.xq> :142,9: user-defined error [err:UE004]: Error processing module zerr:ZXQD0002 - " This module provides funtionality to read the text from PDF documents and pdfbox. apache. org">Apache PDFBox</a> library is used to jre6\bin\ client" . /zorba/ sandbox/ src/runtime/ errors_ and_diagnostics /errors_ and_diagnostics _impl.cpp: 81
</home/
to render PDF documents to images.
<a href="http://
implement these functions.
<br />
<br />
<b>Note:</b> Since this module has a Java library dependency a JVM required
to be installed on the system. For Windows: jvm.dll is required on the system
path ( usually located in "C:\Program Files\Java\
<b>Note:<b> For Debian based Linux distributions install PdfBox and FontBox
packages: sudo apt-get install libpdfbox-java libfontbox-java
": can not parse as XML for xqdoc: loader parsing error: Opening and ending tag mismatch: b line 0 and root
; raised at /home/mbrantner
- adapt the year in "Copyright 2006-2009 The FLWOR Foundation." in the .xq file (and some other files also)
- would it make sense to return one string per page in the pdf instead of one big string?
- remove commented out code in read-pdf.cpp
- valgrind shows tons of invalid writes. Why? Are they critical? Is there anything we can do?
- would it make sense to return the images in a streaming fashion (i.e. don't create all base64's in a vector)?
- encoding each image shouldn't be necessary and will probably we wasted effort because the images might be written to a file in their binary form