Merge lp:~zorba-coders/zorba/bug1123835 into lp:zorba

Proposed by Juan Zacarias
Status: Rejected
Rejected by: Chris Hillery
Proposed branch: lp:~zorba-coders/zorba/bug1123835
Merge into: lp:zorba
Diff against target: 687 lines (+550/-17)
4 files modified
src/runtime/CMakeLists.txt (+1/-0)
src/runtime/sequences/sequences_impl.cpp (+192/-17)
src/runtime/sequences/unparsed_streambuf.cpp (+226/-0)
src/runtime/sequences/unparsed_streambuf.h (+131/-0)
To merge this branch: bzr merge lp:~zorba-coders/zorba/bug1123835
Reviewer Review Type Date Requested Status
Chris Hillery Needs Fixing
Review via email: mp+158978@code.launchpad.net

Commit message

Fixes for FOTS errors in fn:unparsed-text* functions

Description of the change

Fixes for FOTS errors in fn:unparsed-text* functions

To post a comment you must log in.
Revision history for this message
Juan Zacarias (juan457) wrote :

This branch doesn't solve all the errors yet this is the list of errors missing solution with a brief description of the current problem.

The missing errors are caused by 3 problems
* utf-8 encoding: missing the stream suggested to Paul that can handle errors when invalid utf-8 are found.
* unknown encoding: the function should be able to identify if the text has an unknown encoding, This error is currently not approachable but it may be done if request for the previous issue is available.
* unparsed-text utf-8 bom issue, I sent an email proposing 2 solutions for this problem. But probably not really pretty ones main discussion name is "utf-8 byte order marks" take a look on the email reply chain for more information on this issue.

Regarding this list a bunch of other errors were solved. I suggest we merge this fixes but keep the branch open until utf-8 stream that handles invalid characters is implemented or a new suggestion is made.

UNPARSED-TEXT-LINES
<fots:test-case name="fn-unparsed-text-lines-037" result="fail"/>
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-unparsed-text-lines-038" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-lines-039" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-lines-042" result="fail"/>
uses unparsed-text not unparsed-text-lines (utf-8 bom issue)

UNPARSED-TEXT
<fots:test-case name="fn-unparsed-text-037" result="fail"/>
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-unparsed-text-038" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-039" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-042" result="fail"/>
utf-8 bom problem
<fots:test-case name="fn-unparsed-text-048" result="wrongError"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-049" result="fail"/>
counts one more (can be utf-8 bom)

UNPARSED-TEXT-AVAILABLE
<fots:test-case name="fn-unparsed-text-available-036" result="fail"/>
unknown encoding
<fots:test-case name="fn-unparsed-text-available-037" result="fail"/>
utf-8 validation
<fots:test-case name="fn-unparsed-text-available-038" result="fail"/>
utf-8 validation

Revision history for this message
Chris Hillery (ceejatec) wrote :

1. It looks like you didn't check in some changes? sequences_impl.cpp refers to a method URI::get_encoded_fragment() that doesn't exist.

2. sequences_impl.cpp now shows up on my system as "ISO-8859 English text" rather than "ASCII English text" when I use the "file" command. My editor (QtCreator) refuses to edit this file because of the encoding. This appears to be because you used some non-ASCII characters, specifically when comparing the variable "peek" to various characters. Please use \0xXXXX encodings for these characters so the result is still ASCII, to prevent those characters from being munged accidentally in the future.

3. You have some merge conflicts as well.

I'll review the actual changes more closely when at least (1) is fixed up.

review: Needs Fixing
Revision history for this message
Chris Hillery (ceejatec) wrote :

Ok, maybe ignore point #1 - that function is there after all. I'm not sure how I missed it; must have been a typo in my search. Anyway, I'll review the code more thoroughly a bit later tonight. Points #2 and #3 still need to be fixed.

lp:~zorba-coders/zorba/bug1123835 updated
11284. By sorin.marian.nasoi <email address hidden>

- replaced the "ISO-8859" characters with ASCII equivalents
- added big-endian and little-endian for UTF-16

11285. By sorin.marian.nasoi <email address hidden>

- merge lp:zorba trunk after conflicts were solved

11286. By sorin.marian.nasoi <email address hidden>

- removed big-endian part
- added changes to the EXPECTED_FAILURES

Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

> Ok, maybe ignore point #1 - that function is there after all. I'm not sure how
> I missed it; must have been a typo in my search. Anyway, I'll review the code
> more thoroughly a bit later tonight. Points #2 and #3 still need to be fixed.
Points #2 and #3 should be fixed.

The proposed changes, so far:
- fix 17 'wrongError' test cases (meaning they now raise the correct error and pass)
- fix 5 'fail' test cases (please see changes in 'test/fots/CMakeLists.txt')

Still, there are 15 failures still in fn-unparsed-text* test cases:

- 7 in "fn-unparsed-text-lines":
"fn-unparsed-text-lines-037,fn-unparsed-text-lines-038,fn-unparsed-text-lines-039,fn-unparsed-text-lines-042,fn-unparsed-text-lines-050,fn-unparsed-text-lines-053,fn-unparsed-text-lines-054"

- 3 in "fn-unparsed-text-available":
"fn-unparsed-text-available-036,fn-unparsed-text-available-037,fn-unparsed-text-available-038"

- 5 in "fn-unparsed-text":
"fn-unparsed-text-037,fn-unparsed-text-038,fn-unparsed-text-039,fn-unparsed-text-042,fn-unparsed-text-050"

Revision history for this message
Chris Hillery (ceejatec) wrote :

Unfortunately, I don't think the proposed change is safe. istream::unget() is not guaranteed to work, and in particular it probably won't work if the stream is coming via HTTP.

However, there is a function StreamResource::isStreamSeekable(). Perhaps you could check that flag, and only call the BOM-checking/unget block of code when it returns true. We could at least claim that we are making a best-effort to validate the incoming data.

I would be much happier with something like a three-byte "wrapper" stream buffer that worked the same way BufferedInputStream does in Java, but it seems like that is somehow really hard to accomplish in C++.

review: Needs Fixing
Revision history for this message
Chris Hillery (ceejatec) wrote :

Also, we should not have a check in our code for a base URI "#UNDEFINED". That comes from FOTS; it is not part of Zorba or XQuery. I do not know how to tell whether the base URI of a static context is actually "undefined" or not. In fact I'm not even 100% sure what that means.

review: Needs Fixing
lp:~zorba-coders/zorba/bug1123835 updated
11287. By Juan Zacarias

Implementation of a stream wrapper for unparsed-text Functions.

11288. By Juan Zacarias

Fixed implementation o streambuf wrapper for unparsed-text* functions.

11289. By Juan Zacarias

Fixed divergion of branch.

11290. By Juan Zacarias

Fixes for linux build.

11291. By Juan Zacarias

Merged with trunk added, added impl for utf-8 valid characters.

11292. By Juan Zacarias

Fixed build for Linux.

11293. By Juan Zacarias

Added custom getline for unparsed-text-lines function implementation.

11294. By Juan Zacarias

Fixed some unparsed-text wrong errors.

11295. By Juan Zacarias

Updated expected failure list for unparsed-text* fots tests.

11296. By Juan Zacarias

Fixed throw error for unparsed-text* functions.

11297. By Juan Zacarias

Fixed warnings.

11298. By Juan Zacarias

Changes to fn-unparsed-text-available.

Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

3 of the failing test-cases are correct (fn-unparsed-text-lines-039, fn-unparsed-text-039, fn-unparsed-text-available-038).

Here is why:
The 3 test-cases use fn/unparsed-text/non-xml-character.txt that contains BOM followed by NULL character.

Here is the catch: NULL is not a valid XML 1.0 nor XML 1.1 character.

The F&O spec mentions as an error condition for all 3 unparsed-text* functions:

"A dynamic error is raised [err:FOUT1190]
[...]
if the resulting characters are not permitted XML characters."

lp:~zorba-coders/zorba/bug1123835 updated
11299. By Juan Zacarias

Modified validation of utf8 in the unparsed-text* functions to detect invalid xml.

11300. By Juan Zacarias

Merged with trunk.

11301. By Juan Zacarias

Fixed Wrong Error message error.

Revision history for this message
Chris Hillery (ceejatec) wrote :

Sorry, but this implementation isn't right. You're still sucking the entire contents of the istream into memory via that stringstream. That's not acceptable. (You also have a potential memory leak since you don't use an auto_ptr<> for the stringstream you allocate on the stack, but that will be irrelevant since there shouldn't be a stringstream at all.)

I believe this implementation will also throw an error if there aren't at least 3 bytes in the input stream.

Please re-read my comment from 2013-04-24. The best solution would be to implement a buffer::attach() iostreams class with a 3-byte buffer. This would allow you to safely read and, if necessary, put back 3 bytes to check for a BOM. You could then attach THAT to either unparsed::attach() or transcode::attach() as you do here. I kind of feel like there must be an open-source class that does that already; it's completely generic.

Failing that, the closest thing to a right answer would be to revert to the code you had before using istream::unget(), with an additional check that the stream is seekable before attempting it. It's half a solution, but it's better than no solution.

review: Needs Fixing
lp:~zorba-coders/zorba/bug1123835 updated
11302. By sorin.marian.nasoi <email address hidden>

- merged lp:zorba trunk after fixing the conflicts in test/fots/CMakeLists.txt

11303. By sorin.marian.nasoi <email address hidden>

- merge lp:zorba trunk.

Revision history for this message
Chris Hillery (ceejatec) wrote :

Rejecting this proposal as it stands. If we want to fix this bug in future, we need to revisit the plan.

Unmerged revisions

11303. By sorin.marian.nasoi <email address hidden>

- merge lp:zorba trunk.

11302. By sorin.marian.nasoi <email address hidden>

- merged lp:zorba trunk after fixing the conflicts in test/fots/CMakeLists.txt

11301. By Juan Zacarias

Fixed Wrong Error message error.

11300. By Juan Zacarias

Merged with trunk.

11299. By Juan Zacarias

Modified validation of utf8 in the unparsed-text* functions to detect invalid xml.

11298. By Juan Zacarias

Changes to fn-unparsed-text-available.

11297. By Juan Zacarias

Fixed warnings.

11296. By Juan Zacarias

Fixed throw error for unparsed-text* functions.

11295. By Juan Zacarias

Updated expected failure list for unparsed-text* fots tests.

11294. By Juan Zacarias

Fixed some unparsed-text wrong errors.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/runtime/CMakeLists.txt'
2--- src/runtime/CMakeLists.txt 2013-06-15 02:57:08 +0000
3+++ src/runtime/CMakeLists.txt 2013-08-13 06:08:34 +0000
4@@ -135,6 +135,7 @@
5 numerics/format_integer.cpp
6 numerics/format_number.cpp
7 sequences/SequencesImpl.cpp
8+ sequences/unparsed_streambuf.cpp
9 visitors/iterprinter.cpp
10 update/update.cpp
11 util/item_iterator.cpp
12
13=== modified file 'src/runtime/sequences/sequences_impl.cpp'
14--- src/runtime/sequences/sequences_impl.cpp 2013-07-12 14:15:52 +0000
15+++ src/runtime/sequences/sequences_impl.cpp 2013-08-13 06:08:34 +0000
16@@ -61,6 +61,9 @@
17 #include "zorbautils/hashset_node_itemh.h"
18 #include "zorbautils/hashset_atomic_itemh.h"
19
20+#include <runtime/sequences/unparsed_streambuf.h>
21+#include <zorba/internal/proxy.h>
22+
23 namespace zorbatm = zorba::time;
24
25 using namespace std;
26@@ -2135,8 +2138,14 @@
27 {
28 //Normalize input to handle filesystem paths, etc.
29 zstring lNormUri;
30- normalizeInputUri(aUri, aSctx, loc, &lNormUri);
31-
32+ try
33+ {
34+ normalizeInputUri(aUri, aSctx, loc, &lNormUri);
35+ }
36+ catch (...)
37+ {
38+ throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(aUri), ERROR_LOC(loc));
39+ }
40 //Check for a fragment identifier
41 //Create a zorba::URI for validating if it contains a fragment
42 std::auto_ptr<zorba::URI> lUri(new zorba::URI(lNormUri));
43@@ -2144,12 +2153,17 @@
44 {
45 throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(aUri), ERROR_LOC(loc));
46 }
47+
48+ zstring lEncoding = aEncoding;
49+ if (!transcode::is_supported(lEncoding.c_str()))
50+ {
51+ throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(aUri), ERROR_LOC(loc));
52+ }
53
54 //Resolve URI to stream
55 zstring lErrorMessage;
56 std::auto_ptr<internal::Resource> lResource = aSctx->resolve_uri
57 (lNormUri, internal::EntityData::SOME_CONTENT, lErrorMessage);
58-
59 internal::StreamResource* lStreamResource =
60 dynamic_cast<internal::StreamResource*>(lResource.get());
61
62@@ -2159,18 +2173,51 @@
63 }
64 StreamReleaser lStreamReleaser = lStreamResource->getStreamReleaser();
65 std::unique_ptr<std::istream, StreamReleaser> lStream(lStreamResource->getStream(), lStreamReleaser);
66-
67 lStreamResource->setStreamReleaser(nullptr);
68
69+ char lBOM[3];
70+ char lUTF8BOM[] = { 239, 187, 191 };
71+ char lUTF16BOMBE[] = { 254, 255 };
72+ char lUTF16BOMLE[] = { 255, 254 };
73+ zstring lEncoding = aEncoding;
74+ int lBufMark = 0;
75+ lStream->read(lBOM, 3);
76+ if ( lUTF8BOM[0] == lBOM[0] ||
77+ lUTF8BOM[1] == lBOM[1] ||
78+ lUTF8BOM[2] == lBOM[2])
79+ {
80+ lEncoding = "UTF-8";
81+ lBufMark = 3;
82+ //unparsed::attach(*lStream.get(), 0);
83+ }
84+ else if ( (lUTF16BOMBE[0] == lBOM[0] && lUTF16BOMBE[1] == lBOM[1]) ||
85+ (lUTF16BOMLE[0] == lBOM[0] && lUTF16BOMLE[1] == lBOM[1]))
86+ {
87+ lEncoding = "UTF-16";
88+ std::stringstream* stream = new stringstream();
89+ stream->write(lBOM, 3);
90+ *stream << lStream->rdbuf();
91+ lStream->rdbuf(stream->rdbuf());
92+ }
93+ else
94+ {
95+ std::stringstream* stream = new stringstream();
96+ stream->write(lBOM, 3);
97+ *stream << lStream->rdbuf();
98+ lStream->rdbuf(stream->rdbuf());
99+ }
100+
101 //check if encoding is needed
102- if (transcode::is_necessary(aEncoding.c_str()))
103- {
104- if (!transcode::is_supported(aEncoding.c_str()))
105- {
106- throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(aUri), ERROR_LOC(loc));
107- }
108- transcode::attach(*lStream.get(), aEncoding.c_str());
109- }
110+
111+ if (transcode::is_necessary(lEncoding.c_str()))
112+ {
113+ transcode::attach(*lStream.get(), lEncoding.c_str());
114+ }
115+ else
116+ {
117+ unparsed::attach(*lStream.get(), lBufMark, aUri, loc);
118+ }
119+
120 //creates stream item
121 GENV_ITEMFACTORY->createStreamableString(
122 oResult,
123@@ -2224,6 +2271,7 @@
124 store::Item_t encodingItem;
125 zstring uriString;
126 zstring encodingString("UTF-8");
127+ zstring lSctxUri;
128
129 PlanIteratorState* state;
130 DEFAULT_STACK_INIT(PlanIteratorState, state, planState);
131@@ -2241,6 +2289,10 @@
132
133 uriItem->getStringValue2(uriString);
134
135+ lSctxUri = theSctx->get_base_uri();
136+ if (lSctxUri == "" || lSctxUri == "file:///#UNDEFINED")
137+ throw XQUERY_EXCEPTION(err::XPST0001, ERROR_PARAMS(uriString), ERROR_LOC(loc));
138+
139 try
140 {
141 readDocument(uriString, encodingString, theSctx, planState, loc, unparsedText);
142@@ -2258,6 +2310,87 @@
143 /*******************************************************************************
144 14.8.6 fn:unparsed-text-lines
145 ********************************************************************************/
146+template<typename CharType,class TraitsType,class Rep>
147+std::basic_istream<CharType,TraitsType>&
148+getline_no_endlines( std::basic_istream<CharType,TraitsType> &is, rstring<Rep> &s) {
149+ typedef std::basic_istream<CharType,TraitsType> istream_type;
150+ typedef typename istream_type::int_type int_type;
151+ typedef std::basic_streambuf<CharType,TraitsType> streambuf_type;
152+ typedef rstring<Rep> string_type;
153+ typedef typename string_type::size_type size_type;
154+
155+ std::ios_base::iostate err = std::ios_base::iostate( std::ios_base::goodbit );
156+ size_type extracted = 0;
157+ int_type const idelim1 = TraitsType::to_int_type( '\r' );
158+ int_type const idelim2 = TraitsType::to_int_type( '\n' );
159+ int_type const eof = TraitsType::eof();
160+ std::string check ="";
161+ s.clear();
162+ try {
163+ streambuf_type *const sb = is.rdbuf();
164+ int_type c = sb->sgetc();
165+
166+ while ( !TraitsType::eq_int_type( c, eof ) &&
167+ ( !TraitsType::eq_int_type( c, idelim1 ) &&
168+ !TraitsType::eq_int_type( c, idelim2 ) ) ) {
169+ s += TraitsType::to_char_type( c );
170+ check += TraitsType::to_char_type( c );
171+ ++extracted;
172+ c = sb->snextc();
173+ }
174+ if ( TraitsType::eq_int_type( c, eof ) )
175+ err |= std::ios_base::eofbit;
176+ else if ( TraitsType::eq_int_type (c, idelim1) ) {
177+ ++extracted;
178+ sb->sbumpc();
179+ c = sb->sgetc();
180+ if (!c)
181+ {
182+ ++extracted;
183+ sb->sbumpc();
184+ c = sb->sgetc();
185+ }
186+ if ( TraitsType::eq_int_type( c, eof ))
187+ {
188+ err |= std::ios_base::eofbit;
189+ }
190+ if ( TraitsType::eq_int_type( c, idelim2 ) ) {
191+ ++extracted;
192+ sb->sbumpc();
193+ c = sb->sgetc();
194+ if (!c)
195+ {
196+ ++extracted;
197+ sb->sbumpc();
198+ c = sb->sgetc();
199+ }
200+ if ( TraitsType::eq_int_type( c, eof ))
201+ {
202+ err |= std::ios_base::eofbit;
203+ }
204+ }
205+ }
206+ else if ( TraitsType::eq_int_type( c, idelim2 ) ) {
207+ ++extracted;
208+ sb->sbumpc();
209+ c = sb->sgetc();
210+ if ( TraitsType::eq_int_type( c, eof ))
211+ {
212+ err |= std::ios_base::eofbit;
213+ }
214+ } else
215+ err |= std::ios_base::failbit;
216+ }
217+ catch ( ... ) {
218+ is.setstate( std::ios_base::badbit );
219+ }
220+ if ( !extracted )
221+ err |= std::ios_base::failbit;
222+ if ( err )
223+ is.setstate( err );
224+ return is;
225+}
226+
227 FnUnparsedTextLinesIteratorState::~FnUnparsedTextLinesIteratorState()
228 {
229 delete theStream;
230@@ -2278,7 +2411,12 @@
231 std::auto_ptr<internal::Resource> lResource;
232 StreamReleaser lStreamReleaser;
233 std::auto_ptr<zorba::URI> lUri;
234-
235+ char lBOM[3];
236+ char lUTF8BOM[] = { 239, 187, 191 };
237+ char lUTF16BOMBE[] = { 254, 255 };
238+ char lUTF16BOMLE[] = { 255, 254 };
239+ int lBufMark(0);
240+
241 FnUnparsedTextLinesIteratorState* state;
242 DEFAULT_STACK_INIT(FnUnparsedTextLinesIteratorState, state, planState);
243
244@@ -2295,7 +2433,14 @@
245
246 //Normalize input to handle filesystem paths, etc.
247 uriItem->getStringValue2(uriString);
248- normalizeInputUri(uriString, theSctx, loc, &lNormUri);
249+ try
250+ {
251+ normalizeInputUri(uriString, theSctx, loc, &lNormUri);
252+ }
253+ catch (...)
254+ {
255+ throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(uriString), ERROR_LOC(loc));
256+ }
257
258 //Check for a fragment identifier
259 //Create a zorba::URI for validating if it contains a fragment
260@@ -2308,7 +2453,7 @@
261 //Resolve URI to stream
262 lResource = theSctx->resolve_uri
263 (lNormUri, internal::EntityData::SOME_CONTENT, lErrorMessage);
264-
265+
266 state->theStreamResource =
267 dynamic_cast<internal::StreamResource*>(lResource.get());
268
269@@ -2319,7 +2464,33 @@
270 state->theStream = new std::unique_ptr<std::istream, StreamReleaser> (state->theStreamResource->getStream(), lStreamReleaser);
271 state->theStreamResource->setStreamReleaser(nullptr);
272
273- //check if encoding is needed
274+ //Check for bom utf-8 and remove the bom definition
275+ state->theStream->get()->read(lBOM, 3);
276+ if ( lUTF8BOM[0] == lBOM[0] ||
277+ lUTF8BOM[1] == lBOM[1] ||
278+ lUTF8BOM[2] == lBOM[2])
279+ {
280+ encodingString = "UTF-8";
281+ lBufMark = 3;
282+ //unparsed::attach(*state->theStream->get(), 3);
283+ }
284+ else if ( (lUTF16BOMBE[0] == lBOM[0] && lUTF16BOMBE[1] == lBOM[1]) ||
285+ (lUTF16BOMLE[0] == lBOM[0] && lUTF16BOMLE[1] == lBOM[1]))
286+ {
287+ encodingString = "UTF-16";
288+ std::stringstream* stream = new stringstream();
289+ stream->write(lBOM, 3);
290+ *stream << state->theStream->get()->rdbuf();
291+ state->theStream->get()->rdbuf(stream->rdbuf());
292+ }
293+ else
294+ {
295+ std::stringstream* stream = new stringstream();
296+ stream->write(lBOM, 3);
297+ *stream << state->theStream->get()->rdbuf();
298+ state->theStream->get()->rdbuf(stream->rdbuf());
299+ }
300+
301 if (transcode::is_necessary(encodingString.c_str()))
302 {
303 if (!transcode::is_supported(encodingString.c_str()))
304@@ -2328,10 +2499,14 @@
305 }
306 transcode::attach(*state->theStream->get(), encodingString.c_str());
307 }
308+ else
309+ {
310+ unparsed::attach(*state->theStream->get(), lBufMark, uriString, loc);
311+ }
312
313 while (state->theStream->get()->good())
314 {
315- getline(*state->theStream->get(), streamLine);
316+ getline_no_endlines(*state->theStream->get(), streamLine);
317 STACK_PUSH(GENV_ITEMFACTORY->createString(result, streamLine), state);
318 }
319
320
321=== added file 'src/runtime/sequences/unparsed_streambuf.cpp'
322--- src/runtime/sequences/unparsed_streambuf.cpp 1970-01-01 00:00:00 +0000
323+++ src/runtime/sequences/unparsed_streambuf.cpp 2013-08-13 06:08:34 +0000
324@@ -0,0 +1,226 @@
325+#include "unparsed_streambuf.h"
326+
327+#include "diagnostics/xquery_diagnostics.h"
328+#include "diagnostics/util_macros.h"
329+#include "util/oseparator.h"
330+
331+#include <iomanip>
332+
333+using namespace std;
334+
335+namespace zorba {
336+namespace unparsed{
337+ namespace xml{
338+ streambuf::pos_type streambuf::seekoff( off_type o, ios_base::seekdir d, ios_base::openmode m )
339+ {
340+ clear();
341+ return original()->pubseekoff( o, d, m );
342+ }
343+
344+ streambuf::pos_type streambuf::seekpos( pos_type p, ios_base::openmode m )
345+ {
346+ clear();
347+ return original()->pubseekpos( p, m );
348+ }
349+
350+ streambuf::int_type streambuf::pbackfail( int_type c )
351+ {
352+ if ( !traits_type::eq_int_type( c, traits_type::eof() ) &&
353+ gbuf_.cur_len_ &&
354+ original()->sputbackc( traits_type::to_char_type( c ) ) ) {
355+ --gbuf_.cur_len_;
356+ return c;
357+ }
358+ return traits_type::eof();
359+ }
360+
361+ streambuf::int_type streambuf::uflow()
362+ {
363+ #ifdef ZORBA_DEBUG_UTF8_STREAMBUF
364+ printf( "uflow()\n" );
365+ #endif
366+ int_type const c = original()->sbumpc();
367+ if ( traits_type::eq_int_type( c, traits_type::eof() ) )
368+ return traits_type::eof();
369+ gbuf_.validate( traits_type::to_char_type( c ) );
370+ return c;
371+ }
372+
373+ inline void streambuf::clear() {
374+ gbuf_.clear();
375+ }
376+
377+ streamsize streambuf::xsgetn(char_type* to, std::streamsize size )
378+ {
379+ #ifdef ZORBA_DEBUG_UTF8_STREAMBUF
380+ printf( "xsgetn()\n" );
381+ #endif
382+ streamsize return_size = 0;
383+
384+ if ( gbuf_.char_len_ ) {
385+ streamsize const want = gbuf_.char_len_ - gbuf_.cur_len_;
386+ streamsize const get = min( want, size );
387+ streamsize const got = original()->sgetn( to, get );
388+ for ( streamsize i = 0; i < got; ++i )
389+ gbuf_.validate( to[i] );
390+ to += got;
391+ size -= got, return_size += got;
392+ }
393+
394+ while ( size > 0 ) {
395+ if ( streamsize const got = original()->sgetn( to, size ) ) {
396+ for ( streamsize i = 0; i < got; ++i )
397+ gbuf_.validate( to[i] );
398+ to += got;
399+ size -= got, return_size += got;
400+ } else
401+ break;
402+ }
403+ return return_size;
404+ }
405+
406+ inline void streambuf::buf_type::clear()
407+ {
408+ char_len_ = 0;
409+ }
410+
411+ void streambuf::buf_type::throw_invalid_utf8( utf8::storage_type *buf, utf8::size_type len ) {
412+ ostringstream oss;
413+ oss << hex << setfill('0') << setw(2) << uppercase;
414+ oseparator comma( ',' );
415+
416+ for ( utf8::size_type i = 0; i < len; ++i )
417+ oss << comma << "0x" << (static_cast<unsigned>( buf[i] ) & 0xFF);
418+
419+ clear();
420+ throw ZORBA_EXCEPTION(
421+ zerr::ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE,
422+ ERROR_PARAMS( oss.str() )
423+ );
424+ }
425+
426+ void streambuf::buf_type::validate( utf8::storage_type c, bool bump ) {
427+ utf8::size_type char_len_copy = char_len_, cur_len_copy = cur_len_;
428+
429+ if ( !char_len_copy ) {
430+ //
431+ // This means we're (hopefully) at the first byte of a UTF-8 byte sequence
432+ // comprising a character.
433+ //
434+ try {
435+ char_len_copy = utf8::char_length( c );
436+ cur_len_copy = 0;
437+ if (!c)
438+ throw_invalid_utf8 ( &c, 1);
439+ }
440+ catch ( utf8::invalid_byte const& ) {
441+ throw_invalid_utf8( &c, 1 );
442+ }
443+ }
444+
445+ utf8::storage_type *const cur_byte_ptr = utf8_char_ + cur_len_copy;
446+ utf8::storage_type const old_byte = *cur_byte_ptr;
447+ *cur_byte_ptr = c;
448+
449+ if ( cur_len_copy++ && !utf8::is_continuation_byte( c ) )
450+ throw_invalid_utf8( utf8_char_, cur_len_copy );
451+
452+ if ( bump ) {
453+ char_len_ = (cur_len_copy == char_len_copy ? 0 : char_len_copy);
454+ cur_len_ = cur_len_copy;
455+ } else {
456+ *cur_byte_ptr = old_byte;
457+ }
458+ }
459+ } //xml namespace
460+
461+ streambuf::streambuf(std::streambuf* orig, zstring const& uri, QueryLoc const& loc) :
462+ proxy_buf(new xml::streambuf(orig)),
463+ i_uri(uri),
464+ i_loc(loc),
465+ i_mark(0)
466+ {
467+ }
468+
469+ streambuf::streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc) :
470+ proxy_buf(new xml::streambuf(orig)),
471+ i_uri(uri),
472+ i_loc(loc),
473+ i_mark(mark)
474+ {
475+ }
476+
477+ streambuf::~streambuf() {}
478+
479+ void streambuf::imbue( std::locale const &loc)
480+ {
481+ proxy_buf->pubimbue( loc );
482+ }
483+
484+ streambuf::pos_type streambuf::seekoff( off_type o, ios_base::seekdir d, ios_base::openmode m )
485+ {
486+ return proxy_buf->pubseekoff( o + i_mark, d, m);
487+ }
488+
489+ streambuf::pos_type streambuf::seekpos( pos_type p, ios_base::openmode m ) {
490+ return proxy_buf->pubseekpos( p, m );
491+ }
492+
493+ std::streambuf* streambuf::setbuf( char_type *p, streamsize s ) {
494+ proxy_buf->pubsetbuf( p, s );
495+ return this;
496+ }
497+
498+ streamsize streambuf::showmanyc() {
499+ return proxy_buf->in_avail();
500+ }
501+
502+ int streambuf::sync() {
503+ return proxy_buf->pubsync();
504+ }
505+
506+ streambuf::int_type streambuf::overflow( int_type c ) {
507+ return proxy_buf->sputc( c );
508+ }
509+
510+ streambuf::int_type streambuf::pbackfail( int_type c ) {
511+ return traits_type::eq_int_type( c, traits_type::eof() ) ?
512+ c : proxy_buf->sputbackc( traits_type::to_char_type( c ) );
513+ }
514+
515+ streambuf::int_type streambuf::uflow() {
516+ return proxy_buf->sbumpc();
517+ }
518+
519+ streambuf::int_type streambuf::underflow() {
520+ return proxy_buf->sgetc();
521+ }
522+
523+ streamsize streambuf::xsgetn( char_type *to, streamsize size ) {
524+ streamsize res;
525+ try
526+ {
527+ res = proxy_buf->sgetn( to, size );
528+ }
529+ catch (ZorbaException const& e)
530+ {
531+ if (e.diagnostic() == zerr::ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE)
532+ throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(i_uri.c_str()), ERROR_LOC(i_loc));
533+ else throw;
534+ }
535+ return res;
536+ }
537+
538+ streamsize streambuf::xsputn( char_type const *from, streamsize size ) {
539+ return proxy_buf->sputn( from, size );
540+ }
541+
542+ /*********************************************************************/
543+ /*********************************************************************/
544+ std::streambuf* alloc_streambuf(std::streambuf *orig, int mark, zstring const& uri, QueryLoc const& loc)
545+ {
546+ return new zorba::unparsed::streambuf(orig, mark, uri, loc);
547+ }
548+
549+}//namesapce unparsed
550+}//namesapce zorba
551\ No newline at end of file
552
553=== added file 'src/runtime/sequences/unparsed_streambuf.h'
554--- src/runtime/sequences/unparsed_streambuf.h 1970-01-01 00:00:00 +0000
555+++ src/runtime/sequences/unparsed_streambuf.h 2013-08-13 06:08:34 +0000
556@@ -0,0 +1,131 @@
557+#ifndef ZORBA_UNPARSED_STREAM_H
558+#define ZORBA_UNPARSED_STREAM_H
559+
560+#include <zorba/internal/streambuf.h>
561+#include <zorba/internal/unique_ptr.h>
562+#include "common/shared_types.h"
563+#include "diagnostics/xquery_diagnostics.h"
564+#include "diagnostics/util_macros.h"
565+#include "util/utf8_streambuf.h"
566+
567+
568+namespace zorba{
569+namespace unparsed{
570+ namespace xml{
571+ //Streambuf class for validating valid characters on xml 1.0 and xml 1.1
572+ class streambuf : public utf8::streambuf
573+ {
574+ public:
575+ streambuf(std::streambuf* orig) :
576+ utf8::streambuf( orig, false ){ clear(); }
577+
578+ private:
579+ struct buf_type
580+ {
581+
582+ utf8::encoded_char_type utf8_char_;
583+ utf8::size_type char_len_;
584+ utf8::size_type cur_len_;
585+
586+ void clear();
587+ void throw_invalid_utf8( utf8::storage_type *buf, utf8::size_type len );
588+ void validate( utf8::storage_type, bool bump = true );
589+ };
590+
591+ buf_type gbuf_;
592+
593+ protected:
594+ void clear();
595+ std::streamsize xsgetn(char_type*, std::streamsize);
596+ pos_type seekoff( off_type, std::ios_base::seekdir, std::ios_base::openmode );
597+ pos_type seekpos( pos_type, std::ios_base::openmode );
598+ int_type pbackfail( int_type );
599+ int_type uflow();
600+ };
601+
602+ }//namespace xml
603+
604+ class streambuf :public std::streambuf {
605+ public:
606+ streambuf(std::streambuf* orig, zstring const& uri, QueryLoc const& loc);
607+ streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc);
608+ ~streambuf();
609+
610+ std::streambuf* orig_streambuf() const {
611+ return proxy_buf->original();
612+ }
613+
614+ protected:
615+ void imbue( std::locale const& );
616+ pos_type seekoff( off_type, std::ios_base::seekdir, std::ios_base::openmode );
617+ pos_type seekpos( pos_type, std::ios_base::openmode );
618+ std::streambuf* setbuf( char_type*, std::streamsize );
619+ std::streamsize showmanyc();
620+ int sync();
621+ int_type overflow( int_type );
622+ int_type pbackfail( int_type );
623+ int_type uflow();
624+ int_type underflow();
625+ std::streamsize xsgetn( char_type*, std::streamsize );
626+ std::streamsize xsputn( char_type const*, std::streamsize );
627+
628+ private:
629+ std::unique_ptr<internal::proxy_streambuf> proxy_buf;
630+ zstring i_uri;
631+ QueryLoc i_loc;
632+ int i_mark;
633+
634+ streambuf( streambuf const&);
635+ streambuf& operator=( streambuf const&);
636+ };
637+
638+
639+ std::streambuf* alloc_streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc);
640+
641+ template<typename charT, class Traits> inline
642+ void attach( std::basic_ios<charT,Traits> &ios, int mark, zstring const& uri, QueryLoc const& loc)
643+ {
644+ int const index = std::ios_base::xalloc();
645+ void *&pword = ios.pword( index );
646+ if ( !pword ) {
647+ std::streambuf *const buf =
648+ alloc_streambuf( ios.rdbuf(), mark, uri, loc );
649+ ios.rdbuf( buf );
650+ pword = buf;
651+ ios.register_callback( internal::stream_callback, index );
652+ }
653+ }
654+
655+ template<typename charT, class Traits> inline
656+ void detach( std::basic_ios<charT,Traits> &ios )
657+ {
658+ int const index = std::ios_base::xalloc();
659+ if ( streambuf* const buf = static_cast<streambuf*>( ios.pword( index ) ) )
660+ {
661+ ios.pword( index ) = 0;
662+ ios.rdbuf( buf->orig_streambuf() );
663+ delete buf;
664+ }
665+ }
666+
667+ template<class StreamType>
668+ class auto_attach {
669+ public:
670+ auto_attach(StreamType &stream, int mark, zstring const& uri, QueryLoc const& loc) : stream_( stream )
671+ {
672+ attach(stream, mark, uri, loc);
673+ }
674+
675+ ~auto_attach()
676+ {
677+ detach( stream_ );
678+ }
679+
680+ private:
681+ StreamType &stream_;
682+ };
683+
684+}//namespace unparsed
685+}//namesapce zorba
686+
687+#endif

Subscribers

People subscribed via source and target branches