Merge lp:~zorba-coders/zorba/fread-pdf-trunk into lp:zorba

Proposed by Cezar Andrei on 2012-09-19
Status: Superseded
Proposed branch: lp:~zorba-coders/zorba/fread-pdf-trunk
Merge into: lp:zorba
Diff against target: 175 lines (+41/-12)
7 files modified
ChangeLog (+1/-0)
cmake_modules/ZorbaModule.cmake (+26/-1)
include/zorba/item_factory.h (+7/-7)
modules/ExternalModules.conf (+2/-1)
src/store/naive/atomic_items.h (+2/-2)
test/rbkt/modules/CMakeLists.txt (+2/-1)
test/rbkt/modules/ext2_config.txt.in (+1/-0)
To merge this branch: bzr merge lp:~zorba-coders/zorba/fread-pdf-trunk
Reviewer Review Type Date Requested Status
Matthias Brantner 2012-09-19 Needs Fixing on 2012-09-21
Chris Hillery 2012-09-19 Approve on 2012-09-19
Cezar Andrei Needs Fixing on 2012-09-19
Review via email: mp+125338@code.launchpad.net

This proposal has been superseded by a proposal from 2012-09-22.

Commit message

Make doc comments for createBase64Binary more explicit on what parameters they expect and what they do.
Change return value to xs_int for getIntValue() method.

Description of the change

Make doc comments for createBaser64Binary more explicit on what parameters they expect and what they do.
Change, return value to xs_int for getIntValue() method.

To post a comment you must log in.
Chris Hillery (ceejatec) wrote :

You changed the URIs of the test modules back to www.zorba-xquery.com, instead of zorba-tests.28msec.us. That will cause test failures. Make sure you've merged from the trunk recently.

review: Needs Fixing
Chris Hillery (ceejatec) wrote :

(Other than that the changes look fine)

Cezar Andrei (cezar-andrei) wrote :

It's up to date now.

review: Approve
Chris Hillery (ceejatec) wrote :

You still haven't corrected the URIs of the test modules, though (test/rbkt/modules/CMakeLists.txt).

Cezar Andrei (cezar-andrei) wrote :

Let me add the new module.

review: Needs Fixing
Cezar Andrei (cezar-andrei) wrote :

The ref to the new module is in, please review it too: lp:zorba/read-pdf-module at
https://code.launchpad.net/~zorba-coders/zorba/fread-pdf-the-module .

Cezar Andrei (cezar-andrei) wrote :

Back to zorba-tests.28msec.us in test/rbkt/modules/CMakeLists.txt and mention in Changelog.

Chris Hillery (ceejatec) wrote :

Great, thanks.

review: Approve
Zorba Build Bot (zorba-buildbot) wrote :

Attempt to merge into lp:zorba failed due to conflicts:

text conflict in ChangeLog

It looks nice (without being able to testing it ;-)

- The module should follow XQuery coding conventions. For example, no camel-case
but function and variable names using dashes.

- The read-pdf:renderToImages-internal function is not documented. This will probably cause
the xqdoc test to fail.

- There is a conflict in the ChangeLog.

- The schema contains some commented out code => remove?

- The declaration of the $options variable for the -internal functions should not allow for
an optional occurrence indicator.

review: Needs Fixing
Cezar Andrei (cezar-andrei) wrote :

I fixed all the comments with the exception of the optional occurrence indicator for internal functions. I tested it with empty sequence and works fine.

Zorba Build Bot (zorba-buildbot) wrote :

Validation queue job fread-pdf-trunk-2012-09-20T03-03-43.261Z is finished. The final status was:

All tests succeeded!

Zorba Build Bot (zorba-buildbot) wrote :

Voting does not meet specified criteria. Required: Approve > 1, Disapprove < 1, Needs Fixing < 1, Pending < 1. Got: 1 Approve, 2 Needs Fixing.

10954. By Cezar Andrei <email address hidden> on 2012-09-21

Depend on the latest version of util-jvm module.

Paul J. Lucas (paul-lucas) wrote :

It doesn't look like the fpdf.* files were added via "bzr add" -- were they supposed to be?

Cezar Andrei (cezar-andrei) wrote :

Paul,

If you're asking about my last commit msg, it is as it should be just one line change, by deleting the tag name, it makes trunk use the latest HEAD version of the module branch. The files in util-jvm were already reviewed and merged see this merge: https://code.launchpad.net/~zorba-coders/zorba/fread-pdf-utiljvm-module/+merge/125332 .

The changes in file: modules/ExternalModules.conf'

117 --- modules/ExternalModules.conf 2012-09-17 00:36:37 +0000
118 +++ modules/ExternalModules.conf 2012-09-21 15:25:26 +0000
119 @@ -39,11 +39,12 @@
123 +read-pdf bzr lp:zorba/read-pdf-module
128 -util-jvm bzr lp:zorba/util-jvm-module zorba-2.6
129 +util-jvm bzr lp:zorba/util-jvm-module

The module works pretty decent. I was able to extract text or generate images for several pdfs without any problems.
There are some minor things that should be discussed and/or fixed:

- the error seems to be too general, essentially it always raises JAVA-EXCEPTION no matter what goes wrong (e.g. it the given input is not a valid pdf)

- the java stack trace seems to be sent to standard error

- Renders the each page of the PDF document as an image. => Renders each page of the PDF document as an image.

- the names of the private functions should also adhere to the code conventions renderToImages => render-to-images

- make xqdoc failes because the comments seem to contain invalid xml
</home/mbrantner/zorba/build/URI_PATH/com/zorba-xquery/www/modules/project_xqdoc.xq>:142,9: user-defined error [err:UE004]: Error processing module zerr:ZXQD0002 - " This module provides funtionality to read the text from PDF documents and
 to render PDF documents to images.
 <a href="http://pdfbox.apache.org">Apache PDFBox</a> library is used to
 implement these functions.
 <br />
 <br />
 <b>Note:</b> Since this module has a Java library dependency a JVM required
 to be installed on the system. For Windows: jvm.dll is required on the system
 path ( usually located in "C:\Program Files\Java\jre6\bin\client".
 <b>Note:<b> For Debian based Linux distributions install PdfBox and FontBox
 packages: sudo apt-get install libpdfbox-java libfontbox-java
": can not parse as XML for xqdoc: loader parsing error: Opening and ending tag mismatch: b line 0 and root
; raised at /home/mbrantner/zorba/sandbox/src/runtime/errors_and_diagnostics/errors_and_diagnostics_impl.cpp:81

- adapt the year in "Copyright 2006-2009 The FLWOR Foundation." in the .xq file (and some other files also)

- would it make sense to return one string per page in the pdf instead of one big string?

- remove commented out code in read-pdf.cpp

- valgrind shows tons of invalid writes. Why? Are they critical? Is there anything we can do?

- would it make sense to return the images in a streaming fashion (i.e. don't create all base64's in a vector)?

- encoding each image shouldn't be necessary and will probably we wasted effort because the images might be written to a file in their binary form

review: Needs Fixing
Cezar Andrei (cezar-andrei) wrote :

See answers inline:
- the error seems to be too general, essentially it always raises JAVA-EXCEPTION no matter what goes wrong (e.g. it the given input is not a valid pdf)
I adapted the error msg to be more clear/specific.

- the java stack trace seems to be sent to standard error
Goes to std err.

- Renders the each page of the PDF document as an image. => Renders each page of the PDF document as an image.
Done.

- the names of the private functions should also adhere to the code conventions renderToImages => render-to-images
Done.

- make xqdoc failes because the comments seem to contain invalid xml
</home/mbrantner/zorba/build/URI_PATH/com/zorba-xquery/www/modules/project_xqdoc.xq>:142,9: user-defined error [err:UE004]: Error processing module zerr:ZXQD0002 - " This module provides funtionality to read the text from PDF documents and
 to render PDF documents to images.
 <a href="http://pdfbox.apache.org">Apache PDFBox</a> library is used to
 implement these functions.
 <br />
 <br />
 <b>Note:</b> Since this module has a Java library dependency a JVM required
 to be installed on the system. For Windows: jvm.dll is required on the system
 path ( usually located in "C:\Program Files\Java\jre6\bin\client".
 <b>Note:<b> For Debian based Linux distributions install PdfBox and FontBox
 packages: sudo apt-get install libpdfbox-java libfontbox-java
": can not parse as XML for xqdoc: loader parsing error: Opening and ending tag mismatch: b line 0 and root
; raised at /home/mbrantner/zorba/sandbox/src/runtime/errors_and_diagnostics/errors_and_diagnostics_impl.cpp:81
Done.

- adapt the year in "Copyright 2006-2009 The FLWOR Foundation." in the .xq file (and some other files also)
Done.

- would it make sense to return one string per page in the pdf instead of one big string?
The API doesn't alow it, but I added two more optional options, to insert a user defined string at the start and end of each page.

- remove commented out code in read-pdf.cpp
Done.

- valgrind shows tons of invalid writes. Why? Are they critical? Is there anything we can do?
Jvm always shows in valgrind, even if nothing is done with it. I was careful to remove any allocated memory.

- would it make sense to return the images in a streaming fashion (i.e. don't create all base64's in a vector)?
No, because it's a push write of all images. And as discussed, optimize only a copy in some cases isn't worth the effort.

- encoding each image shouldn't be necessary and will probably we wasted effort because the images might be written to a file in their binary form
Done.

10955. By Cezar Andrei <email address hidden> on 2012-09-26

Merge from trunk, resolved ChangeLog conflict.

10956. By Cezar Andrei <email address hidden> on 2012-09-27

Merge from trunk.
Resolved changeLog conflict.

10957. By Cezar Andrei <email address hidden> on 2012-09-27

Merge from trunk. Resolved conflict in ExternalModules.conf.

10958. By Cezar Andrei <email address hidden> on 2012-09-27

Add tag to read-pdf module.

10959. By Cezar Andrei <email address hidden> on 2012-10-02

Merge from trunk, resolve Changelog conflict.

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ChangeLog'
2--- ChangeLog 2012-09-20 17:43:35 +0000
3+++ ChangeLog 2012-09-21 15:25:26 +0000
4@@ -15,6 +15,7 @@
5 * Implemented semantics of null for comparison and arithmetics operations.
6 * Positional pagination support for index probes
7 * Recognize the no-copy pragma to avoid copying nodes before insertion into a collection.
8+ * Adding new external module read-pdf, it converts PDF documents to text or rendered images.
9
10 Optimizations:
11 * New memory management for compiler expressions (no more ref counting)
12
13=== modified file 'cmake_modules/ZorbaModule.cmake'
14--- cmake_modules/ZorbaModule.cmake 2012-09-17 00:36:37 +0000
15+++ cmake_modules/ZorbaModule.cmake 2012-09-21 15:25:26 +0000
16@@ -114,9 +114,20 @@
17 # relative to CMAKE_CURRENT_SOURCE_DIR)
18 # LINK_LIBRARIES - (optional) List of libraries to link external
19 # function library against
20+# CONFIG_FILES - (optional) List of files to configure with package
21+# information; see below
22 # TEST_ONLY - (optional) Module is for testcases only and should not
23 # be installed
24 #
25+# CONFIG_FILES - any files specific here will be copied to
26+# CMAKE_CURRENT_BINARY_DIR using CONFIGURE_FILE(). They may contain
27+# the following @VARIABLES@ which will be substituted:
28+# ZORBA_MODULE_RELATIVE_DIR - directory portion of mangled URI
29+# ZORBA_MODULE_LIBFILE_WE - filename (without extension) portion of
30+# mangled URI
31+# The input files should have a .in extension. The resulting file in
32+# the build directory will have the .in removed.
33+#
34 # QQQ this currently doesn't support modules with multiple component
35 # .xq files. (Neither does Zorba's automatic loading mechanism, so
36 # this probably isn't a huge deal, but worth thinking about.)
37@@ -125,7 +136,7 @@
38 # file enough to deduce the URI and version?
39 MACRO (DECLARE_ZORBA_MODULE)
40 # Parse and validate arguments
41- PARSE_ARGUMENTS(MODULE "LINK_LIBRARIES;EXTRA_SOURCES"
42+ PARSE_ARGUMENTS(MODULE "LINK_LIBRARIES;EXTRA_SOURCES;CONFIG_FILES"
43 "URI;FILE;VERSION" "TEST_ONLY" ${ARGN})
44 IF (NOT MODULE_FILE)
45 MESSAGE (FATAL_ERROR "'FILE' argument is required for ZORBA_DECLARE_MODULE()")
46@@ -353,6 +364,20 @@
47 "${version_infix}" "" 1 "${MODULE_TEST_ONLY}")
48 ENDFOREACH (version_infix)
49
50+ # Configure any module-specified config files.
51+ SET (ZORBA_MODULE_RELATIVE_DIR ${module_path})
52+ SET (ZORBA_MODULE_LIBFILE_WE ${module_filewe})
53+ FOREACH (_config_file ${MODULE_CONFIG_FILES})
54+ # Strip off .in - can't use GET_FILENAME_COMPONENT as it always removes
55+ # the longest possible extension
56+ STRING (REGEX REPLACE "\\.in$" "" _config_filename_we "${_config_file}")
57+ IF (NOT IS_ABSOLUTE "${_config_file}")
58+ SET (_config_file "${CMAKE_CURRENT_SOURCE_DIR}/${_config_file}")
59+ ENDIF (NOT IS_ABSOLUTE "${_config_file}")
60+ CONFIGURE_FILE (${_config_file}
61+ "${CMAKE_CURRENT_BINARY_DIR}/${_config_filename_we}" @ONLY)
62+ ENDFOREACH (_config_file)
63+
64 # Last but not least, whip up a test case that ensures the module
65 # can at least be compiled. Don't bother for test-only modules
66 # (presumably they're there to be tested!).
67
68=== modified file 'include/zorba/item_factory.h'
69--- include/zorba/item_factory.h 2012-09-17 00:36:37 +0000
70+++ include/zorba/item_factory.h 2012-09-21 15:25:26 +0000
71@@ -123,8 +123,8 @@
72 /** \brief Creates a Base64Binary Item
73 * see [http://www.w3.org/TR/xmlschema-2/#base64Binary]
74 *
75- * @param aBinData a pointer to the base6c4 binary data.
76- * @param aLength the length of the base64 binary data.
77+ * @param aBinData a pointer to the base64 encoded data. The data is copied from aBinData.
78+ * @param aLength the length of the base64 encoded data.
79 * @return The Base64Binary Item.
80 */
81 virtual Item
82@@ -133,7 +133,7 @@
83 /** \brief Creates a Base64Binary Item
84 * see [http://www.w3.org/TR/xmlschema-2/#base64Binary]
85 *
86- * @param aStream A stream containing the Base64 encoded data.
87+ * @param aStream A stream containing the Base64 encoded data. The data is copied from aStream imediately.
88 * @return the Base64Binary Item.
89 */
90 virtual Item
91@@ -142,11 +142,11 @@
92 /** \brief Creates a Base64Binary Item
93 * see [http://www.w3.org/TR/xmlschema-2/#base64Binary]
94 *
95- * @param aBinData the data in binary form. The data is copied from aBinData.
96- * @param aLength the length of the data
97+ * @param aBinData the data in binary form (not encoded). The data is copied from aBinData.
98+ * @param aLength the length of the binary data
99 * @return the Base64Binary Item.
100 */
101- virtual Item
102+ virtual Item
103 createBase64Binary(const unsigned char* aBinData, size_t aLength) = 0;
104
105 /** \brief Creates a streamable Base64Binary Item
106@@ -735,7 +735,7 @@
107 * @param aNames A vector containing the name and value of each pair.
108 */
109 virtual Item createJSONObject(std::vector<std::pair<Item, Item> >& aNames) = 0;
110-
111+
112 /**
113 * Create a JSON Array containing the specified items.
114 *
115
116=== modified file 'modules/ExternalModules.conf'
117--- modules/ExternalModules.conf 2012-09-17 00:36:37 +0000
118+++ modules/ExternalModules.conf 2012-09-21 15:25:26 +0000
119@@ -39,11 +39,12 @@
120 languages bzr lp:zorba/languages-module zorba-2.6
121 oauth bzr lp:zorba/oauth-module zorba-2.6
122 process bzr lp:zorba/process-module zorba-2.6
123+read-pdf bzr lp:zorba/read-pdf-module
124 security bzr lp:zorba/security-module zorba-2.6
125 system bzr lp:zorba/system-module zorba-2.6
126 xqxq bzr lp:zorba/xqxq-module zorba-2.6
127 email bzr lp:zorba/email-module zorba-2.6
128-util-jvm bzr lp:zorba/util-jvm-module zorba-2.6
129+util-jvm bzr lp:zorba/util-jvm-module
130 schema-tools bzr lp:zorba/schema-tools-module zorba-2.6
131 stack bzr lp:zorba/stack-module zorba-2.6
132 queue bzr lp:zorba/queue-module zorba-2.6
133
134=== modified file 'src/store/naive/atomic_items.h'
135--- src/store/naive/atomic_items.h 2012-09-17 00:36:37 +0000
136+++ src/store/naive/atomic_items.h 2012-09-21 15:25:26 +0000
137@@ -1776,7 +1776,7 @@
138 friend class AtomicItem;
139
140 protected:
141- int32_t theValue;
142+ xs_int theValue;
143
144 protected:
145 IntItem(xs_int aValue) : theValue(aValue) {}
146@@ -1792,7 +1792,7 @@
147
148 xs_long getLongValue() const { return static_cast<xs_long>(theValue); }
149
150- int32_t getIntValue() const { return theValue; }
151+ xs_int getIntValue() const { return theValue; }
152
153 store::SchemaTypeCode getTypeCode() const { return store::XS_INT; }
154
155
156=== modified file 'test/rbkt/modules/CMakeLists.txt'
157--- test/rbkt/modules/CMakeLists.txt 2012-09-17 00:36:37 +0000
158+++ test/rbkt/modules/CMakeLists.txt 2012-09-21 15:25:26 +0000
159@@ -27,8 +27,9 @@
160 DECLARE_ZORBA_MODULE(URI "http://zorba-tests.28msec.us/modules/B" VERSION 1.0
161 FILE "${CMAKE_CURRENT_SOURCE_DIR}/module-B.xq" TEST_ONLY)
162
163+# This one also has a CONFIG_FILE
164 DECLARE_ZORBA_MODULE(URI "http://zorba-tests.28msec.us/modules/ext"
165- VERSION 2.0 FILE "ext2.xq" TEST_ONLY)
166+ VERSION 2.0 FILE "ext2.xq" TEST_ONLY CONFIG_FILES ext2_config.txt.in)
167 DECLARE_ZORBA_MODULE(URI "http://zorba-tests.28msec.us/modules/ext"
168 VERSION 1.0 FILE "ext.xq" TEST_ONLY)
169
170
171=== added file 'test/rbkt/modules/ext2_config.txt.in'
172--- test/rbkt/modules/ext2_config.txt.in 1970-01-01 00:00:00 +0000
173+++ test/rbkt/modules/ext2_config.txt.in 2012-09-21 15:25:26 +0000
174@@ -0,0 +1,1 @@
175+This is a test: "@ZORBA_MODULE_RELATIVE_DIR@" "@ZORBA_MODULE_LIBFILE_WE@"

Subscribers

People subscribed via source and target branches