Merge lp:~zorba-coders/zorba/bug1123835 into lp:zorba
- bug1123835
- Merge into trunk
Status: | Rejected | ||||||||
---|---|---|---|---|---|---|---|---|---|
Rejected by: | Chris Hillery | ||||||||
Proposed branch: | lp:~zorba-coders/zorba/bug1123835 | ||||||||
Merge into: | lp:zorba | ||||||||
Diff against target: |
687 lines (+550/-17) 4 files modified
src/runtime/CMakeLists.txt (+1/-0) src/runtime/sequences/sequences_impl.cpp (+192/-17) src/runtime/sequences/unparsed_streambuf.cpp (+226/-0) src/runtime/sequences/unparsed_streambuf.h (+131/-0) |
||||||||
To merge this branch: | bzr merge lp:~zorba-coders/zorba/bug1123835 | ||||||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Chris Hillery | Needs Fixing | ||
Review via email: mp+158978@code.launchpad.net |
Commit message
Fixes for FOTS errors in fn:unparsed-text* functions
Description of the change
Fixes for FOTS errors in fn:unparsed-text* functions
Juan Zacarias (juan457) wrote : | # |
Chris Hillery (ceejatec) wrote : | # |
1. It looks like you didn't check in some changes? sequences_impl.cpp refers to a method URI::get_
2. sequences_impl.cpp now shows up on my system as "ISO-8859 English text" rather than "ASCII English text" when I use the "file" command. My editor (QtCreator) refuses to edit this file because of the encoding. This appears to be because you used some non-ASCII characters, specifically when comparing the variable "peek" to various characters. Please use \0xXXXX encodings for these characters so the result is still ASCII, to prevent those characters from being munged accidentally in the future.
3. You have some merge conflicts as well.
I'll review the actual changes more closely when at least (1) is fixed up.
Chris Hillery (ceejatec) wrote : | # |
Ok, maybe ignore point #1 - that function is there after all. I'm not sure how I missed it; must have been a typo in my search. Anyway, I'll review the code more thoroughly a bit later tonight. Points #2 and #3 still need to be fixed.
- 11284. By sorin.marian.nasoi <email address hidden>
-
- replaced the "ISO-8859" characters with ASCII equivalents
- added big-endian and little-endian for UTF-16 - 11285. By sorin.marian.nasoi <email address hidden>
-
- merge lp:zorba trunk after conflicts were solved
- 11286. By sorin.marian.nasoi <email address hidden>
-
- removed big-endian part
- added changes to the EXPECTED_FAILURES
Sorin Marian Nasoi (sorin.marian.nasoi) wrote : | # |
> Ok, maybe ignore point #1 - that function is there after all. I'm not sure how
> I missed it; must have been a typo in my search. Anyway, I'll review the code
> more thoroughly a bit later tonight. Points #2 and #3 still need to be fixed.
Points #2 and #3 should be fixed.
The proposed changes, so far:
- fix 17 'wrongError' test cases (meaning they now raise the correct error and pass)
- fix 5 'fail' test cases (please see changes in 'test/fots/
Still, there are 15 failures still in fn-unparsed-text* test cases:
- 7 in "fn-unparsed-
"fn-unparsed-
- 3 in "fn-unparsed-
"fn-unparsed-
- 5 in "fn-unparsed-text":
"fn-unparsed-
Chris Hillery (ceejatec) wrote : | # |
Unfortunately, I don't think the proposed change is safe. istream::unget() is not guaranteed to work, and in particular it probably won't work if the stream is coming via HTTP.
However, there is a function StreamResource:
I would be much happier with something like a three-byte "wrapper" stream buffer that worked the same way BufferedInputStream does in Java, but it seems like that is somehow really hard to accomplish in C++.
Chris Hillery (ceejatec) wrote : | # |
Also, we should not have a check in our code for a base URI "#UNDEFINED". That comes from FOTS; it is not part of Zorba or XQuery. I do not know how to tell whether the base URI of a static context is actually "undefined" or not. In fact I'm not even 100% sure what that means.
- 11287. By Juan Zacarias
-
Implementation of a stream wrapper for unparsed-text Functions.
- 11288. By Juan Zacarias
-
Fixed implementation o streambuf wrapper for unparsed-text* functions.
- 11289. By Juan Zacarias
-
Fixed divergion of branch.
- 11290. By Juan Zacarias
-
Fixes for linux build.
- 11291. By Juan Zacarias
-
Merged with trunk added, added impl for utf-8 valid characters.
- 11292. By Juan Zacarias
-
Fixed build for Linux.
- 11293. By Juan Zacarias
-
Added custom getline for unparsed-text-lines function implementation.
- 11294. By Juan Zacarias
-
Fixed some unparsed-text wrong errors.
- 11295. By Juan Zacarias
-
Updated expected failure list for unparsed-text* fots tests.
- 11296. By Juan Zacarias
-
Fixed throw error for unparsed-text* functions.
- 11297. By Juan Zacarias
-
Fixed warnings.
- 11298. By Juan Zacarias
-
Changes to fn-unparsed-
text-available.
Sorin Marian Nasoi (sorin.marian.nasoi) wrote : | # |
3 of the failing test-cases are correct (fn-unparsed-
Here is why:
The 3 test-cases use fn/unparsed-
Here is the catch: NULL is not a valid XML 1.0 nor XML 1.1 character.
The F&O spec mentions as an error condition for all 3 unparsed-text* functions:
"A dynamic error is raised [err:FOUT1190]
[...]
if the resulting characters are not permitted XML characters."
- 11299. By Juan Zacarias
-
Modified validation of utf8 in the unparsed-text* functions to detect invalid xml.
- 11300. By Juan Zacarias
-
Merged with trunk.
- 11301. By Juan Zacarias
-
Fixed Wrong Error message error.
Chris Hillery (ceejatec) wrote : | # |
Sorry, but this implementation isn't right. You're still sucking the entire contents of the istream into memory via that stringstream. That's not acceptable. (You also have a potential memory leak since you don't use an auto_ptr<> for the stringstream you allocate on the stack, but that will be irrelevant since there shouldn't be a stringstream at all.)
I believe this implementation will also throw an error if there aren't at least 3 bytes in the input stream.
Please re-read my comment from 2013-04-24. The best solution would be to implement a buffer::attach() iostreams class with a 3-byte buffer. This would allow you to safely read and, if necessary, put back 3 bytes to check for a BOM. You could then attach THAT to either unparsed::attach() or transcode::attach() as you do here. I kind of feel like there must be an open-source class that does that already; it's completely generic.
Failing that, the closest thing to a right answer would be to revert to the code you had before using istream::unget(), with an additional check that the stream is seekable before attempting it. It's half a solution, but it's better than no solution.
Chris Hillery (ceejatec) wrote : | # |
Rejecting this proposal as it stands. If we want to fix this bug in future, we need to revisit the plan.
Unmerged revisions
- 11303. By sorin.marian.nasoi <email address hidden>
-
- merge lp:zorba trunk.
- 11302. By sorin.marian.nasoi <email address hidden>
-
- merged lp:zorba trunk after fixing the conflicts in test/fots/
CMakeLists. txt - 11301. By Juan Zacarias
-
Fixed Wrong Error message error.
- 11300. By Juan Zacarias
-
Merged with trunk.
- 11299. By Juan Zacarias
-
Modified validation of utf8 in the unparsed-text* functions to detect invalid xml.
- 11298. By Juan Zacarias
-
Changes to fn-unparsed-
text-available. - 11297. By Juan Zacarias
-
Fixed warnings.
- 11296. By Juan Zacarias
-
Fixed throw error for unparsed-text* functions.
- 11295. By Juan Zacarias
-
Updated expected failure list for unparsed-text* fots tests.
- 11294. By Juan Zacarias
-
Fixed some unparsed-text wrong errors.
Preview Diff
1 | === modified file 'src/runtime/CMakeLists.txt' |
2 | --- src/runtime/CMakeLists.txt 2013-06-15 02:57:08 +0000 |
3 | +++ src/runtime/CMakeLists.txt 2013-08-13 06:08:34 +0000 |
4 | @@ -135,6 +135,7 @@ |
5 | numerics/format_integer.cpp |
6 | numerics/format_number.cpp |
7 | sequences/SequencesImpl.cpp |
8 | + sequences/unparsed_streambuf.cpp |
9 | visitors/iterprinter.cpp |
10 | update/update.cpp |
11 | util/item_iterator.cpp |
12 | |
13 | === modified file 'src/runtime/sequences/sequences_impl.cpp' |
14 | --- src/runtime/sequences/sequences_impl.cpp 2013-07-12 14:15:52 +0000 |
15 | +++ src/runtime/sequences/sequences_impl.cpp 2013-08-13 06:08:34 +0000 |
16 | @@ -61,6 +61,9 @@ |
17 | #include "zorbautils/hashset_node_itemh.h" |
18 | #include "zorbautils/hashset_atomic_itemh.h" |
19 | |
20 | +#include <runtime/sequences/unparsed_streambuf.h> |
21 | +#include <zorba/internal/proxy.h> |
22 | + |
23 | namespace zorbatm = zorba::time; |
24 | |
25 | using namespace std; |
26 | @@ -2135,8 +2138,14 @@ |
27 | { |
28 | //Normalize input to handle filesystem paths, etc. |
29 | zstring lNormUri; |
30 | - normalizeInputUri(aUri, aSctx, loc, &lNormUri); |
31 | - |
32 | + try |
33 | + { |
34 | + normalizeInputUri(aUri, aSctx, loc, &lNormUri); |
35 | + } |
36 | + catch (...) |
37 | + { |
38 | + throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(aUri), ERROR_LOC(loc)); |
39 | + } |
40 | //Check for a fragment identifier |
41 | //Create a zorba::URI for validating if it contains a fragment |
42 | std::auto_ptr<zorba::URI> lUri(new zorba::URI(lNormUri)); |
43 | @@ -2144,12 +2153,17 @@ |
44 | { |
45 | throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(aUri), ERROR_LOC(loc)); |
46 | } |
47 | + |
48 | + zstring lEncoding = aEncoding; |
49 | + if (!transcode::is_supported(lEncoding.c_str())) |
50 | + { |
51 | + throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(aUri), ERROR_LOC(loc)); |
52 | + } |
53 | |
54 | //Resolve URI to stream |
55 | zstring lErrorMessage; |
56 | std::auto_ptr<internal::Resource> lResource = aSctx->resolve_uri |
57 | (lNormUri, internal::EntityData::SOME_CONTENT, lErrorMessage); |
58 | - |
59 | internal::StreamResource* lStreamResource = |
60 | dynamic_cast<internal::StreamResource*>(lResource.get()); |
61 | |
62 | @@ -2159,18 +2173,51 @@ |
63 | } |
64 | StreamReleaser lStreamReleaser = lStreamResource->getStreamReleaser(); |
65 | std::unique_ptr<std::istream, StreamReleaser> lStream(lStreamResource->getStream(), lStreamReleaser); |
66 | - |
67 | lStreamResource->setStreamReleaser(nullptr); |
68 | |
69 | + char lBOM[3]; |
70 | + char lUTF8BOM[] = { 239, 187, 191 }; |
71 | + char lUTF16BOMBE[] = { 254, 255 }; |
72 | + char lUTF16BOMLE[] = { 255, 254 }; |
73 | + zstring lEncoding = aEncoding; |
74 | + int lBufMark = 0; |
75 | + lStream->read(lBOM, 3); |
76 | + if ( lUTF8BOM[0] == lBOM[0] || |
77 | + lUTF8BOM[1] == lBOM[1] || |
78 | + lUTF8BOM[2] == lBOM[2]) |
79 | + { |
80 | + lEncoding = "UTF-8"; |
81 | + lBufMark = 3; |
82 | + //unparsed::attach(*lStream.get(), 0); |
83 | + } |
84 | + else if ( (lUTF16BOMBE[0] == lBOM[0] && lUTF16BOMBE[1] == lBOM[1]) || |
85 | + (lUTF16BOMLE[0] == lBOM[0] && lUTF16BOMLE[1] == lBOM[1])) |
86 | + { |
87 | + lEncoding = "UTF-16"; |
88 | + std::stringstream* stream = new stringstream(); |
89 | + stream->write(lBOM, 3); |
90 | + *stream << lStream->rdbuf(); |
91 | + lStream->rdbuf(stream->rdbuf()); |
92 | + } |
93 | + else |
94 | + { |
95 | + std::stringstream* stream = new stringstream(); |
96 | + stream->write(lBOM, 3); |
97 | + *stream << lStream->rdbuf(); |
98 | + lStream->rdbuf(stream->rdbuf()); |
99 | + } |
100 | + |
101 | //check if encoding is needed |
102 | - if (transcode::is_necessary(aEncoding.c_str())) |
103 | - { |
104 | - if (!transcode::is_supported(aEncoding.c_str())) |
105 | - { |
106 | - throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(aUri), ERROR_LOC(loc)); |
107 | - } |
108 | - transcode::attach(*lStream.get(), aEncoding.c_str()); |
109 | - } |
110 | + |
111 | + if (transcode::is_necessary(lEncoding.c_str())) |
112 | + { |
113 | + transcode::attach(*lStream.get(), lEncoding.c_str()); |
114 | + } |
115 | + else |
116 | + { |
117 | + unparsed::attach(*lStream.get(), lBufMark, aUri, loc); |
118 | + } |
119 | + |
120 | //creates stream item |
121 | GENV_ITEMFACTORY->createStreamableString( |
122 | oResult, |
123 | @@ -2224,6 +2271,7 @@ |
124 | store::Item_t encodingItem; |
125 | zstring uriString; |
126 | zstring encodingString("UTF-8"); |
127 | + zstring lSctxUri; |
128 | |
129 | PlanIteratorState* state; |
130 | DEFAULT_STACK_INIT(PlanIteratorState, state, planState); |
131 | @@ -2241,6 +2289,10 @@ |
132 | |
133 | uriItem->getStringValue2(uriString); |
134 | |
135 | + lSctxUri = theSctx->get_base_uri(); |
136 | + if (lSctxUri == "" || lSctxUri == "file:///#UNDEFINED") |
137 | + throw XQUERY_EXCEPTION(err::XPST0001, ERROR_PARAMS(uriString), ERROR_LOC(loc)); |
138 | + |
139 | try |
140 | { |
141 | readDocument(uriString, encodingString, theSctx, planState, loc, unparsedText); |
142 | @@ -2258,6 +2310,87 @@ |
143 | /******************************************************************************* |
144 | 14.8.6 fn:unparsed-text-lines |
145 | ********************************************************************************/ |
146 | +template<typename CharType,class TraitsType,class Rep> |
147 | +std::basic_istream<CharType,TraitsType>& |
148 | +getline_no_endlines( std::basic_istream<CharType,TraitsType> &is, rstring<Rep> &s) { |
149 | + typedef std::basic_istream<CharType,TraitsType> istream_type; |
150 | + typedef typename istream_type::int_type int_type; |
151 | + typedef std::basic_streambuf<CharType,TraitsType> streambuf_type; |
152 | + typedef rstring<Rep> string_type; |
153 | + typedef typename string_type::size_type size_type; |
154 | + |
155 | + std::ios_base::iostate err = std::ios_base::iostate( std::ios_base::goodbit ); |
156 | + size_type extracted = 0; |
157 | + int_type const idelim1 = TraitsType::to_int_type( '\r' ); |
158 | + int_type const idelim2 = TraitsType::to_int_type( '\n' ); |
159 | + int_type const eof = TraitsType::eof(); |
160 | + std::string check =""; |
161 | + s.clear(); |
162 | + try { |
163 | + streambuf_type *const sb = is.rdbuf(); |
164 | + int_type c = sb->sgetc(); |
165 | + |
166 | + while ( !TraitsType::eq_int_type( c, eof ) && |
167 | + ( !TraitsType::eq_int_type( c, idelim1 ) && |
168 | + !TraitsType::eq_int_type( c, idelim2 ) ) ) { |
169 | + s += TraitsType::to_char_type( c ); |
170 | + check += TraitsType::to_char_type( c ); |
171 | + ++extracted; |
172 | + c = sb->snextc(); |
173 | + } |
174 | + if ( TraitsType::eq_int_type( c, eof ) ) |
175 | + err |= std::ios_base::eofbit; |
176 | + else if ( TraitsType::eq_int_type (c, idelim1) ) { |
177 | + ++extracted; |
178 | + sb->sbumpc(); |
179 | + c = sb->sgetc(); |
180 | + if (!c) |
181 | + { |
182 | + ++extracted; |
183 | + sb->sbumpc(); |
184 | + c = sb->sgetc(); |
185 | + } |
186 | + if ( TraitsType::eq_int_type( c, eof )) |
187 | + { |
188 | + err |= std::ios_base::eofbit; |
189 | + } |
190 | + if ( TraitsType::eq_int_type( c, idelim2 ) ) { |
191 | + ++extracted; |
192 | + sb->sbumpc(); |
193 | + c = sb->sgetc(); |
194 | + if (!c) |
195 | + { |
196 | + ++extracted; |
197 | + sb->sbumpc(); |
198 | + c = sb->sgetc(); |
199 | + } |
200 | + if ( TraitsType::eq_int_type( c, eof )) |
201 | + { |
202 | + err |= std::ios_base::eofbit; |
203 | + } |
204 | + } |
205 | + } |
206 | + else if ( TraitsType::eq_int_type( c, idelim2 ) ) { |
207 | + ++extracted; |
208 | + sb->sbumpc(); |
209 | + c = sb->sgetc(); |
210 | + if ( TraitsType::eq_int_type( c, eof )) |
211 | + { |
212 | + err |= std::ios_base::eofbit; |
213 | + } |
214 | + } else |
215 | + err |= std::ios_base::failbit; |
216 | + } |
217 | + catch ( ... ) { |
218 | + is.setstate( std::ios_base::badbit ); |
219 | + } |
220 | + if ( !extracted ) |
221 | + err |= std::ios_base::failbit; |
222 | + if ( err ) |
223 | + is.setstate( err ); |
224 | + return is; |
225 | +} |
226 | + |
227 | FnUnparsedTextLinesIteratorState::~FnUnparsedTextLinesIteratorState() |
228 | { |
229 | delete theStream; |
230 | @@ -2278,7 +2411,12 @@ |
231 | std::auto_ptr<internal::Resource> lResource; |
232 | StreamReleaser lStreamReleaser; |
233 | std::auto_ptr<zorba::URI> lUri; |
234 | - |
235 | + char lBOM[3]; |
236 | + char lUTF8BOM[] = { 239, 187, 191 }; |
237 | + char lUTF16BOMBE[] = { 254, 255 }; |
238 | + char lUTF16BOMLE[] = { 255, 254 }; |
239 | + int lBufMark(0); |
240 | + |
241 | FnUnparsedTextLinesIteratorState* state; |
242 | DEFAULT_STACK_INIT(FnUnparsedTextLinesIteratorState, state, planState); |
243 | |
244 | @@ -2295,7 +2433,14 @@ |
245 | |
246 | //Normalize input to handle filesystem paths, etc. |
247 | uriItem->getStringValue2(uriString); |
248 | - normalizeInputUri(uriString, theSctx, loc, &lNormUri); |
249 | + try |
250 | + { |
251 | + normalizeInputUri(uriString, theSctx, loc, &lNormUri); |
252 | + } |
253 | + catch (...) |
254 | + { |
255 | + throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(uriString), ERROR_LOC(loc)); |
256 | + } |
257 | |
258 | //Check for a fragment identifier |
259 | //Create a zorba::URI for validating if it contains a fragment |
260 | @@ -2308,7 +2453,7 @@ |
261 | //Resolve URI to stream |
262 | lResource = theSctx->resolve_uri |
263 | (lNormUri, internal::EntityData::SOME_CONTENT, lErrorMessage); |
264 | - |
265 | + |
266 | state->theStreamResource = |
267 | dynamic_cast<internal::StreamResource*>(lResource.get()); |
268 | |
269 | @@ -2319,7 +2464,33 @@ |
270 | state->theStream = new std::unique_ptr<std::istream, StreamReleaser> (state->theStreamResource->getStream(), lStreamReleaser); |
271 | state->theStreamResource->setStreamReleaser(nullptr); |
272 | |
273 | - //check if encoding is needed |
274 | + //Check for bom utf-8 and remove the bom definition |
275 | + state->theStream->get()->read(lBOM, 3); |
276 | + if ( lUTF8BOM[0] == lBOM[0] || |
277 | + lUTF8BOM[1] == lBOM[1] || |
278 | + lUTF8BOM[2] == lBOM[2]) |
279 | + { |
280 | + encodingString = "UTF-8"; |
281 | + lBufMark = 3; |
282 | + //unparsed::attach(*state->theStream->get(), 3); |
283 | + } |
284 | + else if ( (lUTF16BOMBE[0] == lBOM[0] && lUTF16BOMBE[1] == lBOM[1]) || |
285 | + (lUTF16BOMLE[0] == lBOM[0] && lUTF16BOMLE[1] == lBOM[1])) |
286 | + { |
287 | + encodingString = "UTF-16"; |
288 | + std::stringstream* stream = new stringstream(); |
289 | + stream->write(lBOM, 3); |
290 | + *stream << state->theStream->get()->rdbuf(); |
291 | + state->theStream->get()->rdbuf(stream->rdbuf()); |
292 | + } |
293 | + else |
294 | + { |
295 | + std::stringstream* stream = new stringstream(); |
296 | + stream->write(lBOM, 3); |
297 | + *stream << state->theStream->get()->rdbuf(); |
298 | + state->theStream->get()->rdbuf(stream->rdbuf()); |
299 | + } |
300 | + |
301 | if (transcode::is_necessary(encodingString.c_str())) |
302 | { |
303 | if (!transcode::is_supported(encodingString.c_str())) |
304 | @@ -2328,10 +2499,14 @@ |
305 | } |
306 | transcode::attach(*state->theStream->get(), encodingString.c_str()); |
307 | } |
308 | + else |
309 | + { |
310 | + unparsed::attach(*state->theStream->get(), lBufMark, uriString, loc); |
311 | + } |
312 | |
313 | while (state->theStream->get()->good()) |
314 | { |
315 | - getline(*state->theStream->get(), streamLine); |
316 | + getline_no_endlines(*state->theStream->get(), streamLine); |
317 | STACK_PUSH(GENV_ITEMFACTORY->createString(result, streamLine), state); |
318 | } |
319 | |
320 | |
321 | === added file 'src/runtime/sequences/unparsed_streambuf.cpp' |
322 | --- src/runtime/sequences/unparsed_streambuf.cpp 1970-01-01 00:00:00 +0000 |
323 | +++ src/runtime/sequences/unparsed_streambuf.cpp 2013-08-13 06:08:34 +0000 |
324 | @@ -0,0 +1,226 @@ |
325 | +#include "unparsed_streambuf.h" |
326 | + |
327 | +#include "diagnostics/xquery_diagnostics.h" |
328 | +#include "diagnostics/util_macros.h" |
329 | +#include "util/oseparator.h" |
330 | + |
331 | +#include <iomanip> |
332 | + |
333 | +using namespace std; |
334 | + |
335 | +namespace zorba { |
336 | +namespace unparsed{ |
337 | + namespace xml{ |
338 | + streambuf::pos_type streambuf::seekoff( off_type o, ios_base::seekdir d, ios_base::openmode m ) |
339 | + { |
340 | + clear(); |
341 | + return original()->pubseekoff( o, d, m ); |
342 | + } |
343 | + |
344 | + streambuf::pos_type streambuf::seekpos( pos_type p, ios_base::openmode m ) |
345 | + { |
346 | + clear(); |
347 | + return original()->pubseekpos( p, m ); |
348 | + } |
349 | + |
350 | + streambuf::int_type streambuf::pbackfail( int_type c ) |
351 | + { |
352 | + if ( !traits_type::eq_int_type( c, traits_type::eof() ) && |
353 | + gbuf_.cur_len_ && |
354 | + original()->sputbackc( traits_type::to_char_type( c ) ) ) { |
355 | + --gbuf_.cur_len_; |
356 | + return c; |
357 | + } |
358 | + return traits_type::eof(); |
359 | + } |
360 | + |
361 | + streambuf::int_type streambuf::uflow() |
362 | + { |
363 | + #ifdef ZORBA_DEBUG_UTF8_STREAMBUF |
364 | + printf( "uflow()\n" ); |
365 | + #endif |
366 | + int_type const c = original()->sbumpc(); |
367 | + if ( traits_type::eq_int_type( c, traits_type::eof() ) ) |
368 | + return traits_type::eof(); |
369 | + gbuf_.validate( traits_type::to_char_type( c ) ); |
370 | + return c; |
371 | + } |
372 | + |
373 | + inline void streambuf::clear() { |
374 | + gbuf_.clear(); |
375 | + } |
376 | + |
377 | + streamsize streambuf::xsgetn(char_type* to, std::streamsize size ) |
378 | + { |
379 | + #ifdef ZORBA_DEBUG_UTF8_STREAMBUF |
380 | + printf( "xsgetn()\n" ); |
381 | + #endif |
382 | + streamsize return_size = 0; |
383 | + |
384 | + if ( gbuf_.char_len_ ) { |
385 | + streamsize const want = gbuf_.char_len_ - gbuf_.cur_len_; |
386 | + streamsize const get = min( want, size ); |
387 | + streamsize const got = original()->sgetn( to, get ); |
388 | + for ( streamsize i = 0; i < got; ++i ) |
389 | + gbuf_.validate( to[i] ); |
390 | + to += got; |
391 | + size -= got, return_size += got; |
392 | + } |
393 | + |
394 | + while ( size > 0 ) { |
395 | + if ( streamsize const got = original()->sgetn( to, size ) ) { |
396 | + for ( streamsize i = 0; i < got; ++i ) |
397 | + gbuf_.validate( to[i] ); |
398 | + to += got; |
399 | + size -= got, return_size += got; |
400 | + } else |
401 | + break; |
402 | + } |
403 | + return return_size; |
404 | + } |
405 | + |
406 | + inline void streambuf::buf_type::clear() |
407 | + { |
408 | + char_len_ = 0; |
409 | + } |
410 | + |
411 | + void streambuf::buf_type::throw_invalid_utf8( utf8::storage_type *buf, utf8::size_type len ) { |
412 | + ostringstream oss; |
413 | + oss << hex << setfill('0') << setw(2) << uppercase; |
414 | + oseparator comma( ',' ); |
415 | + |
416 | + for ( utf8::size_type i = 0; i < len; ++i ) |
417 | + oss << comma << "0x" << (static_cast<unsigned>( buf[i] ) & 0xFF); |
418 | + |
419 | + clear(); |
420 | + throw ZORBA_EXCEPTION( |
421 | + zerr::ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE, |
422 | + ERROR_PARAMS( oss.str() ) |
423 | + ); |
424 | + } |
425 | + |
426 | + void streambuf::buf_type::validate( utf8::storage_type c, bool bump ) { |
427 | + utf8::size_type char_len_copy = char_len_, cur_len_copy = cur_len_; |
428 | + |
429 | + if ( !char_len_copy ) { |
430 | + // |
431 | + // This means we're (hopefully) at the first byte of a UTF-8 byte sequence |
432 | + // comprising a character. |
433 | + // |
434 | + try { |
435 | + char_len_copy = utf8::char_length( c ); |
436 | + cur_len_copy = 0; |
437 | + if (!c) |
438 | + throw_invalid_utf8 ( &c, 1); |
439 | + } |
440 | + catch ( utf8::invalid_byte const& ) { |
441 | + throw_invalid_utf8( &c, 1 ); |
442 | + } |
443 | + } |
444 | + |
445 | + utf8::storage_type *const cur_byte_ptr = utf8_char_ + cur_len_copy; |
446 | + utf8::storage_type const old_byte = *cur_byte_ptr; |
447 | + *cur_byte_ptr = c; |
448 | + |
449 | + if ( cur_len_copy++ && !utf8::is_continuation_byte( c ) ) |
450 | + throw_invalid_utf8( utf8_char_, cur_len_copy ); |
451 | + |
452 | + if ( bump ) { |
453 | + char_len_ = (cur_len_copy == char_len_copy ? 0 : char_len_copy); |
454 | + cur_len_ = cur_len_copy; |
455 | + } else { |
456 | + *cur_byte_ptr = old_byte; |
457 | + } |
458 | + } |
459 | + } //xml namespace |
460 | + |
461 | + streambuf::streambuf(std::streambuf* orig, zstring const& uri, QueryLoc const& loc) : |
462 | + proxy_buf(new xml::streambuf(orig)), |
463 | + i_uri(uri), |
464 | + i_loc(loc), |
465 | + i_mark(0) |
466 | + { |
467 | + } |
468 | + |
469 | + streambuf::streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc) : |
470 | + proxy_buf(new xml::streambuf(orig)), |
471 | + i_uri(uri), |
472 | + i_loc(loc), |
473 | + i_mark(mark) |
474 | + { |
475 | + } |
476 | + |
477 | + streambuf::~streambuf() {} |
478 | + |
479 | + void streambuf::imbue( std::locale const &loc) |
480 | + { |
481 | + proxy_buf->pubimbue( loc ); |
482 | + } |
483 | + |
484 | + streambuf::pos_type streambuf::seekoff( off_type o, ios_base::seekdir d, ios_base::openmode m ) |
485 | + { |
486 | + return proxy_buf->pubseekoff( o + i_mark, d, m); |
487 | + } |
488 | + |
489 | + streambuf::pos_type streambuf::seekpos( pos_type p, ios_base::openmode m ) { |
490 | + return proxy_buf->pubseekpos( p, m ); |
491 | + } |
492 | + |
493 | + std::streambuf* streambuf::setbuf( char_type *p, streamsize s ) { |
494 | + proxy_buf->pubsetbuf( p, s ); |
495 | + return this; |
496 | + } |
497 | + |
498 | + streamsize streambuf::showmanyc() { |
499 | + return proxy_buf->in_avail(); |
500 | + } |
501 | + |
502 | + int streambuf::sync() { |
503 | + return proxy_buf->pubsync(); |
504 | + } |
505 | + |
506 | + streambuf::int_type streambuf::overflow( int_type c ) { |
507 | + return proxy_buf->sputc( c ); |
508 | + } |
509 | + |
510 | + streambuf::int_type streambuf::pbackfail( int_type c ) { |
511 | + return traits_type::eq_int_type( c, traits_type::eof() ) ? |
512 | + c : proxy_buf->sputbackc( traits_type::to_char_type( c ) ); |
513 | + } |
514 | + |
515 | + streambuf::int_type streambuf::uflow() { |
516 | + return proxy_buf->sbumpc(); |
517 | + } |
518 | + |
519 | + streambuf::int_type streambuf::underflow() { |
520 | + return proxy_buf->sgetc(); |
521 | + } |
522 | + |
523 | + streamsize streambuf::xsgetn( char_type *to, streamsize size ) { |
524 | + streamsize res; |
525 | + try |
526 | + { |
527 | + res = proxy_buf->sgetn( to, size ); |
528 | + } |
529 | + catch (ZorbaException const& e) |
530 | + { |
531 | + if (e.diagnostic() == zerr::ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE) |
532 | + throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(i_uri.c_str()), ERROR_LOC(i_loc)); |
533 | + else throw; |
534 | + } |
535 | + return res; |
536 | + } |
537 | + |
538 | + streamsize streambuf::xsputn( char_type const *from, streamsize size ) { |
539 | + return proxy_buf->sputn( from, size ); |
540 | + } |
541 | + |
542 | + /*********************************************************************/ |
543 | + /*********************************************************************/ |
544 | + std::streambuf* alloc_streambuf(std::streambuf *orig, int mark, zstring const& uri, QueryLoc const& loc) |
545 | + { |
546 | + return new zorba::unparsed::streambuf(orig, mark, uri, loc); |
547 | + } |
548 | + |
549 | +}//namesapce unparsed |
550 | +}//namesapce zorba |
551 | \ No newline at end of file |
552 | |
553 | === added file 'src/runtime/sequences/unparsed_streambuf.h' |
554 | --- src/runtime/sequences/unparsed_streambuf.h 1970-01-01 00:00:00 +0000 |
555 | +++ src/runtime/sequences/unparsed_streambuf.h 2013-08-13 06:08:34 +0000 |
556 | @@ -0,0 +1,131 @@ |
557 | +#ifndef ZORBA_UNPARSED_STREAM_H |
558 | +#define ZORBA_UNPARSED_STREAM_H |
559 | + |
560 | +#include <zorba/internal/streambuf.h> |
561 | +#include <zorba/internal/unique_ptr.h> |
562 | +#include "common/shared_types.h" |
563 | +#include "diagnostics/xquery_diagnostics.h" |
564 | +#include "diagnostics/util_macros.h" |
565 | +#include "util/utf8_streambuf.h" |
566 | + |
567 | + |
568 | +namespace zorba{ |
569 | +namespace unparsed{ |
570 | + namespace xml{ |
571 | + //Streambuf class for validating valid characters on xml 1.0 and xml 1.1 |
572 | + class streambuf : public utf8::streambuf |
573 | + { |
574 | + public: |
575 | + streambuf(std::streambuf* orig) : |
576 | + utf8::streambuf( orig, false ){ clear(); } |
577 | + |
578 | + private: |
579 | + struct buf_type |
580 | + { |
581 | + |
582 | + utf8::encoded_char_type utf8_char_; |
583 | + utf8::size_type char_len_; |
584 | + utf8::size_type cur_len_; |
585 | + |
586 | + void clear(); |
587 | + void throw_invalid_utf8( utf8::storage_type *buf, utf8::size_type len ); |
588 | + void validate( utf8::storage_type, bool bump = true ); |
589 | + }; |
590 | + |
591 | + buf_type gbuf_; |
592 | + |
593 | + protected: |
594 | + void clear(); |
595 | + std::streamsize xsgetn(char_type*, std::streamsize); |
596 | + pos_type seekoff( off_type, std::ios_base::seekdir, std::ios_base::openmode ); |
597 | + pos_type seekpos( pos_type, std::ios_base::openmode ); |
598 | + int_type pbackfail( int_type ); |
599 | + int_type uflow(); |
600 | + }; |
601 | + |
602 | + }//namespace xml |
603 | + |
604 | + class streambuf :public std::streambuf { |
605 | + public: |
606 | + streambuf(std::streambuf* orig, zstring const& uri, QueryLoc const& loc); |
607 | + streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc); |
608 | + ~streambuf(); |
609 | + |
610 | + std::streambuf* orig_streambuf() const { |
611 | + return proxy_buf->original(); |
612 | + } |
613 | + |
614 | + protected: |
615 | + void imbue( std::locale const& ); |
616 | + pos_type seekoff( off_type, std::ios_base::seekdir, std::ios_base::openmode ); |
617 | + pos_type seekpos( pos_type, std::ios_base::openmode ); |
618 | + std::streambuf* setbuf( char_type*, std::streamsize ); |
619 | + std::streamsize showmanyc(); |
620 | + int sync(); |
621 | + int_type overflow( int_type ); |
622 | + int_type pbackfail( int_type ); |
623 | + int_type uflow(); |
624 | + int_type underflow(); |
625 | + std::streamsize xsgetn( char_type*, std::streamsize ); |
626 | + std::streamsize xsputn( char_type const*, std::streamsize ); |
627 | + |
628 | + private: |
629 | + std::unique_ptr<internal::proxy_streambuf> proxy_buf; |
630 | + zstring i_uri; |
631 | + QueryLoc i_loc; |
632 | + int i_mark; |
633 | + |
634 | + streambuf( streambuf const&); |
635 | + streambuf& operator=( streambuf const&); |
636 | + }; |
637 | + |
638 | + |
639 | + std::streambuf* alloc_streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc); |
640 | + |
641 | + template<typename charT, class Traits> inline |
642 | + void attach( std::basic_ios<charT,Traits> &ios, int mark, zstring const& uri, QueryLoc const& loc) |
643 | + { |
644 | + int const index = std::ios_base::xalloc(); |
645 | + void *&pword = ios.pword( index ); |
646 | + if ( !pword ) { |
647 | + std::streambuf *const buf = |
648 | + alloc_streambuf( ios.rdbuf(), mark, uri, loc ); |
649 | + ios.rdbuf( buf ); |
650 | + pword = buf; |
651 | + ios.register_callback( internal::stream_callback, index ); |
652 | + } |
653 | + } |
654 | + |
655 | + template<typename charT, class Traits> inline |
656 | + void detach( std::basic_ios<charT,Traits> &ios ) |
657 | + { |
658 | + int const index = std::ios_base::xalloc(); |
659 | + if ( streambuf* const buf = static_cast<streambuf*>( ios.pword( index ) ) ) |
660 | + { |
661 | + ios.pword( index ) = 0; |
662 | + ios.rdbuf( buf->orig_streambuf() ); |
663 | + delete buf; |
664 | + } |
665 | + } |
666 | + |
667 | + template<class StreamType> |
668 | + class auto_attach { |
669 | + public: |
670 | + auto_attach(StreamType &stream, int mark, zstring const& uri, QueryLoc const& loc) : stream_( stream ) |
671 | + { |
672 | + attach(stream, mark, uri, loc); |
673 | + } |
674 | + |
675 | + ~auto_attach() |
676 | + { |
677 | + detach( stream_ ); |
678 | + } |
679 | + |
680 | + private: |
681 | + StreamType &stream_; |
682 | + }; |
683 | + |
684 | +}//namespace unparsed |
685 | +}//namesapce zorba |
686 | + |
687 | +#endif |
This branch doesn't solve all the errors yet this is the list of errors missing solution with a brief description of the current problem.
The missing errors are caused by 3 problems
* utf-8 encoding: missing the stream suggested to Paul that can handle errors when invalid utf-8 are found.
* unknown encoding: the function should be able to identify if the text has an unknown encoding, This error is currently not approachable but it may be done if request for the previous issue is available.
* unparsed-text utf-8 bom issue, I sent an email proposing 2 solutions for this problem. But probably not really pretty ones main discussion name is "utf-8 byte order marks" take a look on the email reply chain for more information on this issue.
Regarding this list a bunch of other errors were solved. I suggest we merge this fixes but keep the branch open until utf-8 stream that handles invalid characters is implemented or a new suggestion is made.
UNPARSED-TEXT-LINES unparsed- text-lines- 037" result="fail"/> unparsed- text-lines- 038" result="fail"/> unparsed- text-lines- 039" result="fail"/> unparsed- text-lines- 042" result="fail"/>
<fots:test-case name="fn-
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-
uses unparsed-text not unparsed-text-lines (utf-8 bom issue)
UNPARSED-TEXT unparsed- text-037" result="fail"/> unparsed- text-038" result="fail"/> unparsed- text-039" result="fail"/> unparsed- text-042" result="fail"/> unparsed- text-048" result= "wrongError" /> unparsed- text-049" result="fail"/>
<fots:test-case name="fn-
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-
utf-8 bom problem
<fots:test-case name="fn-
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-
counts one more (can be utf-8 bom)
UNPARSED- TEXT-AVAILABLE unparsed- text-available- 036" result="fail"/> unparsed- text-available- 037" result="fail"/> unparsed- text-available- 038" result="fail"/>
<fots:test-case name="fn-
unknown encoding
<fots:test-case name="fn-
utf-8 validation
<fots:test-case name="fn-
utf-8 validation