Merge into trunk : bug1123835 : Code : Zorba

Reviewer	Review Type	Date Requested	Status
Chris Hillery		2013-04-15	Needs Fixing on 2013-06-07
Review via email: mp+158978@code.launchpad.net

Revision history for this message

Juan Zacarias (juan457) wrote on 2013-04-15:

#

This branch doesn't solve all the errors yet this is the list of errors missing solution with a brief description of the current problem.

The missing errors are caused by 3 problems
* utf-8 encoding: missing the stream suggested to Paul that can handle errors when invalid utf-8 are found.
* unknown encoding: the function should be able to identify if the text has an unknown encoding, This error is currently not approachable but it may be done if request for the previous issue is available.
* unparsed-text utf-8 bom issue, I sent an email proposing 2 solutions for this problem. But probably not really pretty ones main discussion name is "utf-8 byte order marks" take a look on the email reply chain for more information on this issue.

Regarding this list a bunch of other errors were solved. I suggest we merge this fixes but keep the branch open until utf-8 stream that handles invalid characters is implemented or a new suggestion is made.

UNPARSED-TEXT-LINES
<fots:test-case name="fn-unparsed-text-lines-037" result="fail"/>
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-unparsed-text-lines-038" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-lines-039" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-lines-042" result="fail"/>
uses unparsed-text not unparsed-text-lines (utf-8 bom issue)

UNPARSED-TEXT
<fots:test-case name="fn-unparsed-text-037" result="fail"/>
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-unparsed-text-038" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-039" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-042" result="fail"/>
utf-8 bom problem
<fots:test-case name="fn-unparsed-text-048" result="wrongError"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-049" result="fail"/>
counts one more (can be utf-8 bom)

UNPARSED-TEXT-AVAILABLE
<fots:test-case name="fn-unparsed-text-available-036" result="fail"/>
unknown encoding
<fots:test-case name="fn-unparsed-text-available-037" result="fail"/>
utf-8 validation
<fots:test-case name="fn-unparsed-text-available-038" result="fail"/>
utf-8 validation

This branch doesn't solve all the errors yet this is the list of errors missing solution with a brief description of the current problem.

The missing errors are caused by 3 problems
* utf-8 encoding: missing the stream suggested to Paul that can handle errors when invalid utf-8 are found.
* unknown encoding: the function should be able to identify if the text has an unknown encoding, This error is currently not approachable but it may be done if request for the previous issue is available.
* unparsed-text utf-8 bom issue, I sent an email proposing 2 solutions for this problem. But probably not really pretty ones main discussion name is "utf-8 byte order marks" take a look on the email reply chain for more information on this issue.

Regarding this list a bunch of other errors were solved. I suggest we merge this fixes but keep the branch open until utf-8 stream that handles invalid characters is implemented or a new suggestion is made.

UNPARSED-TEXT-LINES    
<fots:test-case name="fn-unparsed-text-lines-037" result="fail"/>
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-unparsed-text-lines-038" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-lines-039" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-lines-042" result="fail"/>
uses unparsed-text not unparsed-text-lines (utf-8 bom issue)

UNPARSED-TEXT
<fots:test-case name="fn-unparsed-text-037" result="fail"/>
Expected Error FOUT1200: unkown encoding
<fots:test-case name="fn-unparsed-text-038" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-039" result="fail"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-042" result="fail"/>
utf-8 bom problem
<fots:test-case name="fn-unparsed-text-048" result="wrongError"/>
Expected Error 1190: utf-8 validation
<fots:test-case name="fn-unparsed-text-049" result="fail"/>   
counts one more (can be utf-8 bom)

UNPARSED-TEXT-AVAILABLE
<fots:test-case name="fn-unparsed-text-available-036" result="fail"/>
unknown encoding
<fots:test-case name="fn-unparsed-text-available-037" result="fail"/>
utf-8 validation
<fots:test-case name="fn-unparsed-text-available-038" result="fail"/>
utf-8 validation

Revision history for this message

Chris Hillery (ceejatec) wrote on 2013-04-16:

#

1. It looks like you didn't check in some changes? sequences_impl.cpp refers to a method URI::get_encoded_fragment() that doesn't exist.

2. sequences_impl.cpp now shows up on my system as "ISO-8859 English text" rather than "ASCII English text" when I use the "file" command. My editor (QtCreator) refuses to edit this file because of the encoding. This appears to be because you used some non-ASCII characters, specifically when comparing the variable "peek" to various characters. Please use \0xXXXX encodings for these characters so the result is still ASCII, to prevent those characters from being munged accidentally in the future.

3. You have some merge conflicts as well.

I'll review the actual changes more closely when at least (1) is fixed up.

review: Needs Fixing

Revision history for this message

Chris Hillery (ceejatec) wrote on 2013-04-16:

#

Ok, maybe ignore point #1 - that function is there after all. I'm not sure how I missed it; must have been a typo in my search. Anyway, I'll review the code more thoroughly a bit later tonight. Points #2 and #3 still need to be fixed.

lp:~zorba-coders/zorba/bug1123835 updated on 2013-04-23

11284. By sorin.marian.nasoi <email address hidden> on 2013-04-23: - replaced the "ISO-8859" characters with ASCII equivalents
- added big-endian and little-endian for UTF-16
11285. By sorin.marian.nasoi <email address hidden> on 2013-04-23: - merge lp:zorba trunk after conflicts were solved
11286. By sorin.marian.nasoi <email address hidden> on 2013-04-23: - removed big-endian part
- added changes to the EXPECTED_FAILURES

Revision history for this message

Sorin Marian Nasoi (sorin.marian.nasoi) wrote on 2013-04-23:

#

> Ok, maybe ignore point #1 - that function is there after all. I'm not sure how
> I missed it; must have been a typo in my search. Anyway, I'll review the code
> more thoroughly a bit later tonight. Points #2 and #3 still need to be fixed.
Points #2 and #3 should be fixed.

The proposed changes, so far:
- fix 17 'wrongError' test cases (meaning they now raise the correct error and pass)
- fix 5 'fail' test cases (please see changes in 'test/fots/CMakeLists.txt')

Still, there are 15 failures still in fn-unparsed-text* test cases:

- 7 in "fn-unparsed-text-lines":
"fn-unparsed-text-lines-037,fn-unparsed-text-lines-038,fn-unparsed-text-lines-039,fn-unparsed-text-lines-042,fn-unparsed-text-lines-050,fn-unparsed-text-lines-053,fn-unparsed-text-lines-054"

- 3 in "fn-unparsed-text-available":
"fn-unparsed-text-available-036,fn-unparsed-text-available-037,fn-unparsed-text-available-038"

- 5 in "fn-unparsed-text":
"fn-unparsed-text-037,fn-unparsed-text-038,fn-unparsed-text-039,fn-unparsed-text-042,fn-unparsed-text-050"

Revision history for this message

Chris Hillery (ceejatec) wrote on 2013-04-24:

#

Unfortunately, I don't think the proposed change is safe. istream::unget() is not guaranteed to work, and in particular it probably won't work if the stream is coming via HTTP.

However, there is a function StreamResource::isStreamSeekable(). Perhaps you could check that flag, and only call the BOM-checking/unget block of code when it returns true. We could at least claim that we are making a best-effort to validate the incoming data.

I would be much happier with something like a three-byte "wrapper" stream buffer that worked the same way BufferedInputStream does in Java, but it seems like that is somehow really hard to accomplish in C++.

review: Needs Fixing

Revision history for this message

Chris Hillery (ceejatec) wrote on 2013-04-24:

#

Also, we should not have a check in our code for a base URI "#UNDEFINED". That comes from FOTS; it is not part of Zorba or XQuery. I do not know how to tell whether the base URI of a static context is actually "undefined" or not. In fact I'm not even 100% sure what that means.

review: Needs Fixing

lp:~zorba-coders/zorba/bug1123835 updated on 2013-05-27

11287. By Juan Zacarias on 2013-04-29: Implementation of a stream wrapper for unparsed-text Functions.
11288. By Juan Zacarias on 2013-05-06: Fixed implementation o streambuf wrapper for unparsed-text* functions.
11289. By Juan Zacarias on 2013-05-06: Fixed divergion of branch.
11290. By Juan Zacarias on 2013-05-06: Fixes for linux build.
11291. By Juan Zacarias on 2013-05-09: Merged with trunk added, added impl for utf-8 valid characters.
11292. By Juan Zacarias on 2013-05-09: Fixed build for Linux.
11293. By Juan Zacarias on 2013-05-13: Added custom getline for unparsed-text-lines function implementation.
11294. By Juan Zacarias on 2013-05-13: Fixed some unparsed-text wrong errors.
11295. By Juan Zacarias on 2013-05-13: Updated expected failure list for unparsed-text* fots tests.
11296. By Juan Zacarias on 2013-05-14: Fixed throw error for unparsed-text* functions.
11297. By Juan Zacarias on 2013-05-14: Fixed warnings.
11298. By Juan Zacarias on 2013-05-27: Changes to fn-unparsed-text-available.

Revision history for this message

Sorin Marian Nasoi (sorin.marian.nasoi) wrote on 2013-05-31:

#

3 of the failing test-cases are correct (fn-unparsed-text-lines-039, fn-unparsed-text-039, fn-unparsed-text-available-038).

Here is why:
The 3 test-cases use fn/unparsed-text/non-xml-character.txt that contains BOM followed by NULL character.

Here is the catch: NULL is not a valid XML 1.0 nor XML 1.1 character.

The F&O spec mentions as an error condition for all 3 unparsed-text* functions:

"A dynamic error is raised [err:FOUT1190]
[...]
if the resulting characters are not permitted XML characters."

lp:~zorba-coders/zorba/bug1123835 updated on 2013-06-03

11299. By Juan Zacarias on 2013-06-03: Modified validation of utf8 in the unparsed-text* functions to detect invalid xml.
11300. By Juan Zacarias on 2013-06-03: Merged with trunk.
11301. By Juan Zacarias on 2013-06-03: Fixed Wrong Error message error.

Revision history for this message

Chris Hillery (ceejatec) wrote on 2013-06-07:

#

Sorry, but this implementation isn't right. You're still sucking the entire contents of the istream into memory via that stringstream. That's not acceptable. (You also have a potential memory leak since you don't use an auto_ptr<> for the stringstream you allocate on the stack, but that will be irrelevant since there shouldn't be a stringstream at all.)

I believe this implementation will also throw an error if there aren't at least 3 bytes in the input stream.

Please re-read my comment from 2013-04-24. The best solution would be to implement a buffer::attach() iostreams class with a 3-byte buffer. This would allow you to safely read and, if necessary, put back 3 bytes to check for a BOM. You could then attach THAT to either unparsed::attach() or transcode::attach() as you do here. I kind of feel like there must be an open-source class that does that already; it's completely generic.

Failing that, the closest thing to a right answer would be to revert to the code you had before using istream::unget(), with an additional check that the stream is seekable before attempting it. It's half a solution, but it's better than no solution.

review: Needs Fixing

lp:~zorba-coders/zorba/bug1123835 updated on 2013-08-13

11302. By sorin.marian.nasoi <email address hidden> on 2013-08-05: - merged lp:zorba trunk after fixing the conflicts in test/fots/CMakeLists.txt
11303. By sorin.marian.nasoi <email address hidden> on 2013-08-13: - merge lp:zorba trunk.

Revision history for this message

Chris Hillery (ceejatec) wrote on 2013-09-18:

#

Rejecting this proposal as it stands. If we want to fix this bug in future, we need to revisit the plan.

Zorba

Merge lp:~zorba-coders/zorba/bug1123835 into lp:zorba

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers

 === modified file 'src/runtime/CMakeLists.txt'
 --- src/runtime/CMakeLists.txt	2013-06-15 02:57:08 +0000
 +++ src/runtime/CMakeLists.txt	2013-08-13 06:08:34 +0000
@@ -135,6 +135,7 @@
    numerics/format_integer.cpp
    numerics/format_number.cpp
    sequences/SequencesImpl.cpp
++  sequences/unparsed_streambuf.cpp
    visitors/iterprinter.cpp
    update/update.cpp
    util/item_iterator.cpp
 === modified file 'src/runtime/sequences/sequences_impl.cpp'
 --- src/runtime/sequences/sequences_impl.cpp	2013-07-12 14:15:52 +0000
 +++ src/runtime/sequences/sequences_impl.cpp	2013-08-13 06:08:34 +0000
@@ -61,6 +61,9 @@
  #include "zorbautils/hashset_node_itemh.h"
  #include "zorbautils/hashset_atomic_itemh.h"
++#include <runtime/sequences/unparsed_streambuf.h>
++#include <zorba/internal/proxy.h>
++
  namespace zorbatm = zorba::time;
  using namespace std;
@@ -2135,8 +2138,14 @@
+ {
    //Normalize input to handle filesystem paths, etc.
    zstring lNormUri;
--  normalizeInputUri(aUri, aSctx, loc, &lNormUri);
--
++  try
++  {
++    normalizeInputUri(aUri, aSctx, loc, &lNormUri);
++  }
++  catch (...)
++  {
++    throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(aUri), ERROR_LOC(loc));
++  }
    //Check for a fragment identifier
    //Create a zorba::URI for validating if it contains a fragment
    std::auto_ptr<zorba::URI> lUri(new zorba::URI(lNormUri));
@@ -2144,12 +2153,17 @@
+   {
      throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(aUri), ERROR_LOC(loc));
+   }
++
++  zstring lEncoding = aEncoding;
++  if (!transcode::is_supported(lEncoding.c_str()))
++  {
++    throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(aUri), ERROR_LOC(loc));
++  }
    //Resolve URI to stream
    zstring lErrorMessage;
    std::auto_ptr<internal::Resource> lResource = aSctx->resolve_uri
      (lNormUri, internal::EntityData::SOME_CONTENT, lErrorMessage);
--
    internal::StreamResource* lStreamResource =
      dynamic_cast<internal::StreamResource*>(lResource.get());
@@ -2159,18 +2173,51 @@
+   }
    StreamReleaser lStreamReleaser = lStreamResource->getStreamReleaser();
    std::unique_ptr<std::istream, StreamReleaser> lStream(lStreamResource->getStream(), lStreamReleaser);
--
    lStreamResource->setStreamReleaser(nullptr);
++  char lBOM[3];
++  char lUTF8BOM[] = { 239, 187, 191 };
++  char lUTF16BOMBE[] = { 254, 255 };
++  char lUTF16BOMLE[] = { 255, 254 };
++  zstring lEncoding = aEncoding;
++  int lBufMark = 0;
++  lStream->read(lBOM, 3);
++  if ( lUTF8BOM[0] == lBOM[0] ||
++       lUTF8BOM[1] == lBOM[1] ||
++       lUTF8BOM[2] == lBOM[2])
++  {
++    lEncoding = "UTF-8";
++    lBufMark = 3;
++    //unparsed::attach(*lStream.get(), 0);
++  }
++  else if ( (lUTF16BOMBE[0] == lBOM[0] && lUTF16BOMBE[1] == lBOM[1]) ||
++    (lUTF16BOMLE[0] == lBOM[0] && lUTF16BOMLE[1] == lBOM[1]))
++  {
++    lEncoding = "UTF-16";
++    std::stringstream* stream = new stringstream();
++    stream->write(lBOM, 3);
++    *stream << lStream->rdbuf();
++    lStream->rdbuf(stream->rdbuf());
++  }
++  else
++  {
++    std::stringstream* stream = new stringstream();
++    stream->write(lBOM, 3);
++    *stream << lStream->rdbuf();
++    lStream->rdbuf(stream->rdbuf());
++  }
++
    //check if encoding is needed
--  if (transcode::is_necessary(aEncoding.c_str()))
--  {
--    if (!transcode::is_supported(aEncoding.c_str()))
--    {
--      throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(aUri), ERROR_LOC(loc));
--    }
--    transcode::attach(*lStream.get(), aEncoding.c_str());
--  }
++
++ if (transcode::is_necessary(lEncoding.c_str()))
++  {
++    transcode::attach(*lStream.get(), lEncoding.c_str());
++  }
++  else
++  {
++    unparsed::attach(*lStream.get(), lBufMark, aUri, loc);
++  }
++
    //creates stream item
    GENV_ITEMFACTORY->createStreamableString(
      oResult,
@@ -2224,6 +2271,7 @@
    store::Item_t encodingItem;
    zstring uriString;
    zstring encodingString("UTF-8");
++  zstring lSctxUri;
    PlanIteratorState* state;
    DEFAULT_STACK_INIT(PlanIteratorState, state, planState);
@@ -2241,6 +2289,10 @@
    uriItem->getStringValue2(uriString);
++  lSctxUri = theSctx->get_base_uri();
++  if (lSctxUri == "" || lSctxUri == "file:///#UNDEFINED")
++    throw XQUERY_EXCEPTION(err::XPST0001, ERROR_PARAMS(uriString), ERROR_LOC(loc));
++
    try
+   {
      readDocument(uriString, encodingString, theSctx, planState, loc, unparsedText);
@@ -2258,6 +2310,87 @@
  /*******************************************************************************
 .8.6 fn:unparsed-text-lines
  ********************************************************************************/
++template<typename CharType,class TraitsType,class Rep>
++std::basic_istream<CharType,TraitsType>&
++getline_no_endlines( std::basic_istream<CharType,TraitsType> &is, rstring<Rep> &s) {
++  typedef std::basic_istream<CharType,TraitsType> istream_type;
++  typedef typename istream_type::int_type int_type;
++  typedef std::basic_streambuf<CharType,TraitsType> streambuf_type;
++  typedef rstring<Rep> string_type;
++  typedef typename string_type::size_type size_type;
++
++  std::ios_base::iostate err = std::ios_base::iostate( std::ios_base::goodbit );
++  size_type extracted = 0;
++  int_type const idelim1 = TraitsType::to_int_type( '\r' );
++  int_type const idelim2 = TraitsType::to_int_type( '\n' );
++  int_type const eof = TraitsType::eof();
++  std::string check ="";
++  s.clear();
++  try {
++    streambuf_type *const sb = is.rdbuf();
++    int_type c = sb->sgetc();
++
++    while ( !TraitsType::eq_int_type( c, eof ) &&
++            ( !TraitsType::eq_int_type( c, idelim1 ) &&
++              !TraitsType::eq_int_type( c, idelim2 ) ) ) {
++      s += TraitsType::to_char_type( c );
++      check += TraitsType::to_char_type( c );
++      ++extracted;
++      c = sb->snextc();
++    }
++    if ( TraitsType::eq_int_type( c, eof ) )
++      err |= std::ios_base::eofbit;
++    else if ( TraitsType::eq_int_type (c, idelim1) ) {
++      ++extracted;
++      sb->sbumpc();
++      c = sb->sgetc();
++      if (!c)
++      {
++        ++extracted;
++        sb->sbumpc();
++        c = sb->sgetc();
++      }
++      if ( TraitsType::eq_int_type( c, eof ))
++      {
++        err |= std::ios_base::eofbit;
++      }
++      if ( TraitsType::eq_int_type( c, idelim2 ) ) {
++        ++extracted;
++        sb->sbumpc();
++        c = sb->sgetc();
++        if (!c)
++        {
++          ++extracted;
++          sb->sbumpc();
++          c = sb->sgetc();
++        }
++        if ( TraitsType::eq_int_type( c, eof ))
++        {
++          err |= std::ios_base::eofbit;
++        }
++      }
++    }
++    else if ( TraitsType::eq_int_type( c, idelim2 ) ) {
++      ++extracted;
++      sb->sbumpc();
++      c = sb->sgetc();
++      if ( TraitsType::eq_int_type( c, eof ))
++      {
++        err |= std::ios_base::eofbit;
++      }
++    } else
++      err |= std::ios_base::failbit;
++  }
++  catch ( ... ) {
++    is.setstate( std::ios_base::badbit );
++  }
++  if ( !extracted )
++    err |= std::ios_base::failbit;
++  if ( err )
++    is.setstate( err );
++  return is;
++}
++
  FnUnparsedTextLinesIteratorState::~FnUnparsedTextLinesIteratorState()
+ {
    delete theStream;
@@ -2278,7 +2411,12 @@
    std::auto_ptr<internal::Resource> lResource;
    StreamReleaser lStreamReleaser;
    std::auto_ptr<zorba::URI> lUri;
--
++  char lBOM[3];
++  char lUTF8BOM[] = { 239, 187, 191 };
++  char lUTF16BOMBE[] = { 254, 255 };
++  char lUTF16BOMLE[] = { 255, 254 };
++  int lBufMark(0);
++
    FnUnparsedTextLinesIteratorState* state;
    DEFAULT_STACK_INIT(FnUnparsedTextLinesIteratorState, state, planState);
@@ -2295,7 +2433,14 @@
    //Normalize input to handle filesystem paths, etc.
    uriItem->getStringValue2(uriString);
--  normalizeInputUri(uriString, theSctx, loc, &lNormUri);
++  try
++  {
++    normalizeInputUri(uriString, theSctx, loc, &lNormUri);
++  }
++  catch (...)
++  {
++    throw XQUERY_EXCEPTION(err::FOUT1170, ERROR_PARAMS(uriString), ERROR_LOC(loc));
++  }
    //Check for a fragment identifier
    //Create a zorba::URI for validating if it contains a fragment
@@ -2308,7 +2453,7 @@
    //Resolve URI to stream
    lResource = theSctx->resolve_uri
      (lNormUri, internal::EntityData::SOME_CONTENT, lErrorMessage);
--
++
    state->theStreamResource =
      dynamic_cast<internal::StreamResource*>(lResource.get());
@@ -2319,7 +2464,33 @@
    state->theStream = new std::unique_ptr<std::istream, StreamReleaser> (state->theStreamResource->getStream(), lStreamReleaser);
    state->theStreamResource->setStreamReleaser(nullptr);
--  //check if encoding is needed
++  //Check for bom utf-8 and remove the bom definition
++  state->theStream->get()->read(lBOM, 3);
++  if ( lUTF8BOM[0] == lBOM[0] ||
++       lUTF8BOM[1] == lBOM[1] ||
++       lUTF8BOM[2] == lBOM[2])
++  {
++    encodingString = "UTF-8";
++    lBufMark = 3;
++    //unparsed::attach(*state->theStream->get(), 3);
++  }
++  else if ( (lUTF16BOMBE[0] == lBOM[0] && lUTF16BOMBE[1] == lBOM[1]) ||
++    (lUTF16BOMLE[0] == lBOM[0] && lUTF16BOMLE[1] == lBOM[1]))
++  {
++    encodingString = "UTF-16";
++    std::stringstream* stream = new stringstream();
++    stream->write(lBOM, 3);
++    *stream << state->theStream->get()->rdbuf();
++    state->theStream->get()->rdbuf(stream->rdbuf());
++  }
++  else
++  {
++    std::stringstream* stream = new stringstream();
++    stream->write(lBOM, 3);
++    *stream << state->theStream->get()->rdbuf();
++    state->theStream->get()->rdbuf(stream->rdbuf());
++  }
++
    if (transcode::is_necessary(encodingString.c_str()))
+   {
      if (!transcode::is_supported(encodingString.c_str()))
@@ -2328,10 +2499,14 @@
+     }
      transcode::attach(*state->theStream->get(), encodingString.c_str());
+   }
++  else
++  {
++    unparsed::attach(*state->theStream->get(), lBufMark, uriString, loc);
++  }
    while (state->theStream->get()->good())
+   {
--    getline(*state->theStream->get(), streamLine);
++    getline_no_endlines(*state->theStream->get(), streamLine);
      STACK_PUSH(GENV_ITEMFACTORY->createString(result, streamLine), state);
+   }
 === added file 'src/runtime/sequences/unparsed_streambuf.cpp'
 --- src/runtime/sequences/unparsed_streambuf.cpp	1970-01-01 00:00:00 +0000
 +++ src/runtime/sequences/unparsed_streambuf.cpp	2013-08-13 06:08:34 +0000
@@ -0,0 +1,226 @@
++#include "unparsed_streambuf.h"
++
++#include "diagnostics/xquery_diagnostics.h"
++#include "diagnostics/util_macros.h"
++#include "util/oseparator.h"
++
++#include <iomanip>
++
++using namespace std;
++
++namespace zorba {
++namespace unparsed{
++  namespace xml{
++    streambuf::pos_type streambuf::seekoff( off_type o, ios_base::seekdir d, ios_base::openmode m )
++    {
++      clear();
++      return original()->pubseekoff( o, d, m );
++    }
++
++    streambuf::pos_type streambuf::seekpos( pos_type p, ios_base::openmode m )
++    {
++      clear();
++      return original()->pubseekpos( p, m );
++    }
++
++    streambuf::int_type streambuf::pbackfail( int_type c )
++    {
++      if ( !traits_type::eq_int_type( c, traits_type::eof() ) &&
++           gbuf_.cur_len_ &&
++           original()->sputbackc( traits_type::to_char_type( c ) ) ) {
++        --gbuf_.cur_len_;
++        return c;
++      }
++      return traits_type::eof();
++    }
++
++    streambuf::int_type streambuf::uflow()
++    {
++    #ifdef ZORBA_DEBUG_UTF8_STREAMBUF
++      printf( "uflow()\n" );
++    #endif
++      int_type const c = original()->sbumpc();
++      if ( traits_type::eq_int_type( c, traits_type::eof() ) )
++        return traits_type::eof();
++      gbuf_.validate( traits_type::to_char_type( c ) );
++      return c;
++    }
++
++    inline void streambuf::clear() {
++      gbuf_.clear();
++    }
++
++    streamsize streambuf::xsgetn(char_type* to, std::streamsize size )
++    {
++    #ifdef ZORBA_DEBUG_UTF8_STREAMBUF
++      printf( "xsgetn()\n" );
++    #endif
++      streamsize return_size = 0;
++
++      if ( gbuf_.char_len_ ) {
++        streamsize const want = gbuf_.char_len_ - gbuf_.cur_len_;
++        streamsize const get = min( want, size );
++        streamsize const got = original()->sgetn( to, get );
++        for ( streamsize i = 0; i < got; ++i )
++          gbuf_.validate( to[i] );
++        to += got;
++        size -= got, return_size += got;
++      }
++
++      while ( size > 0 ) {
++        if ( streamsize const got = original()->sgetn( to, size ) ) {
++          for ( streamsize i = 0; i < got; ++i )
++            gbuf_.validate( to[i] );
++          to += got;
++          size -= got, return_size += got;
++        } else
++          break;
++      }
++      return return_size;
++    }
++
++    inline void streambuf::buf_type::clear()
++    {
++      char_len_ = 0;
++    }
++
++    void streambuf::buf_type::throw_invalid_utf8( utf8::storage_type *buf, utf8::size_type len ) {
++      ostringstream oss;
++      oss << hex << setfill('0') << setw(2) << uppercase;
++      oseparator comma( ',' );
++
++      for ( utf8::size_type i = 0; i < len; ++i )
++        oss << comma << "0x" << (static_cast<unsigned>( buf[i] ) & 0xFF);
++
++      clear();
++      throw ZORBA_EXCEPTION(
++        zerr::ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE,
++        ERROR_PARAMS( oss.str() )
++      );
++    }
++
++    void streambuf::buf_type::validate( utf8::storage_type c, bool bump ) {
++      utf8::size_type char_len_copy = char_len_, cur_len_copy = cur_len_;
++
++      if ( !char_len_copy ) {
++        //
++        // This means we're (hopefully) at the first byte of a UTF-8 byte sequence
++        // comprising a character.
++        //
++        try {
++          char_len_copy = utf8::char_length( c );
++          cur_len_copy = 0;
++          if (!c)
++            throw_invalid_utf8 ( &c, 1);
++        }
++        catch ( utf8::invalid_byte const& ) {
++          throw_invalid_utf8( &c, 1 );
++        }
++      }
++
++      utf8::storage_type *const cur_byte_ptr = utf8_char_ + cur_len_copy;
++      utf8::storage_type const old_byte = *cur_byte_ptr;
++      *cur_byte_ptr = c;
++
++      if ( cur_len_copy++ && !utf8::is_continuation_byte( c ) )
++        throw_invalid_utf8( utf8_char_, cur_len_copy );
++
++      if ( bump ) {
++        char_len_ = (cur_len_copy == char_len_copy ? 0 : char_len_copy);
++        cur_len_ = cur_len_copy;
++      } else {
++        *cur_byte_ptr = old_byte;
++      }
++    }
++  } //xml namespace
++
++  streambuf::streambuf(std::streambuf* orig, zstring const& uri, QueryLoc const& loc) :
++    proxy_buf(new xml::streambuf(orig)),
++    i_uri(uri),
++    i_loc(loc),
++    i_mark(0)
++  {
++  }
++
++  streambuf::streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc) :
++    proxy_buf(new xml::streambuf(orig)),
++    i_uri(uri),
++    i_loc(loc),
++    i_mark(mark)
++  {
++  }
++
++  streambuf::~streambuf() {}
++
++  void streambuf::imbue( std::locale const &loc)
++  {
++    proxy_buf->pubimbue( loc );
++  }
++
++  streambuf::pos_type streambuf::seekoff( off_type o, ios_base::seekdir d, ios_base::openmode m )
++  {
++    return proxy_buf->pubseekoff( o + i_mark, d, m);
++  }
++
++  streambuf::pos_type streambuf::seekpos( pos_type p, ios_base::openmode m ) {
++    return proxy_buf->pubseekpos( p, m );
++  }
++
++  std::streambuf* streambuf::setbuf( char_type *p, streamsize s ) {
++    proxy_buf->pubsetbuf( p, s );
++    return this;
++  }
++
++  streamsize streambuf::showmanyc() {
++    return proxy_buf->in_avail();
++  }
++
++  int streambuf::sync() {
++    return proxy_buf->pubsync();
++  }
++
++  streambuf::int_type streambuf::overflow( int_type c ) {
++    return proxy_buf->sputc( c );
++  }
++
++  streambuf::int_type streambuf::pbackfail( int_type c ) {
++    return  traits_type::eq_int_type( c, traits_type::eof() ) ?
++            c : proxy_buf->sputbackc( traits_type::to_char_type( c ) );
++  }
++
++  streambuf::int_type streambuf::uflow() {
++    return proxy_buf->sbumpc();
++  }
++
++  streambuf::int_type streambuf::underflow() {
++    return proxy_buf->sgetc();
++  }
++
++  streamsize streambuf::xsgetn( char_type *to, streamsize size ) {
++    streamsize res;
++    try
++    {
++      res = proxy_buf->sgetn( to, size );
++    }
++    catch (ZorbaException const& e)
++    {
++      if (e.diagnostic() == zerr::ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE)
++        throw XQUERY_EXCEPTION(err::FOUT1190, ERROR_PARAMS(i_uri.c_str()), ERROR_LOC(i_loc));
++      else throw;
++    }
++    return res;
++  }
++
++  streamsize streambuf::xsputn( char_type const *from, streamsize size ) {
++    return proxy_buf->sputn( from, size );
++    }
++
++  /*********************************************************************/
++  /*********************************************************************/
++  std::streambuf* alloc_streambuf(std::streambuf *orig, int mark, zstring const& uri, QueryLoc const& loc)
++  {
++    return new zorba::unparsed::streambuf(orig, mark, uri, loc);
++  }
++
++}//namesapce unparsed
++}//namesapce zorba
 \ No newline at end of file
 === added file 'src/runtime/sequences/unparsed_streambuf.h'
 --- src/runtime/sequences/unparsed_streambuf.h	1970-01-01 00:00:00 +0000
 +++ src/runtime/sequences/unparsed_streambuf.h	2013-08-13 06:08:34 +0000
@@ -0,0 +1,131 @@
++#ifndef ZORBA_UNPARSED_STREAM_H
++#define ZORBA_UNPARSED_STREAM_H
++
++#include <zorba/internal/streambuf.h>
++#include <zorba/internal/unique_ptr.h>
++#include "common/shared_types.h"
++#include "diagnostics/xquery_diagnostics.h"
++#include "diagnostics/util_macros.h"
++#include "util/utf8_streambuf.h"
++
++
++namespace zorba{
++namespace unparsed{
++  namespace xml{
++    //Streambuf class for validating valid characters on xml 1.0 and xml 1.1
++    class streambuf : public utf8::streambuf
++    {
++    public:
++      streambuf(std::streambuf* orig) :
++        utf8::streambuf( orig, false ){ clear(); }
++
++    private:
++      struct buf_type
++      {
++
++        utf8::encoded_char_type utf8_char_;
++        utf8::size_type char_len_;
++        utf8::size_type cur_len_;
++
++        void clear();
++        void throw_invalid_utf8( utf8::storage_type *buf, utf8::size_type len );
++        void validate( utf8::storage_type, bool bump = true );
++      };
++
++      buf_type gbuf_;
++
++    protected:
++      void clear();
++      std::streamsize xsgetn(char_type*, std::streamsize);
++      pos_type seekoff( off_type, std::ios_base::seekdir, std::ios_base::openmode );
++      pos_type seekpos( pos_type, std::ios_base::openmode );
++      int_type pbackfail( int_type );
++      int_type uflow();
++    };
++
++  }//namespace xml
++
++  class streambuf :public std::streambuf {
++  public:
++    streambuf(std::streambuf* orig, zstring const& uri, QueryLoc const& loc);
++    streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc);
++    ~streambuf();
++
++    std::streambuf* orig_streambuf() const {
++      return proxy_buf->original();
++    }
++
++  protected:
++    void imbue( std::locale const& );
++    pos_type seekoff( off_type, std::ios_base::seekdir, std::ios_base::openmode );
++    pos_type seekpos( pos_type, std::ios_base::openmode );
++    std::streambuf* setbuf( char_type*, std::streamsize );
++    std::streamsize showmanyc();
++    int sync();
++    int_type overflow( int_type );
++    int_type pbackfail( int_type );
++    int_type uflow();
++    int_type underflow();
++    std::streamsize xsgetn( char_type*, std::streamsize );
++    std::streamsize xsputn( char_type const*, std::streamsize );
++
++  private:
++    std::unique_ptr<internal::proxy_streambuf> proxy_buf;
++    zstring i_uri;
++    QueryLoc i_loc;
++    int i_mark;
++
++    streambuf( streambuf const&);
++    streambuf& operator=( streambuf const&);
++  };
++
++
++  std::streambuf* alloc_streambuf(std::streambuf* orig, int mark, zstring const& uri, QueryLoc const& loc);
++
++  template<typename charT, class Traits> inline
++    void attach( std::basic_ios<charT,Traits> &ios, int mark, zstring const& uri, QueryLoc const& loc)
++  {
++    int const index = std::ios_base::xalloc();
++    void *&pword = ios.pword( index );
++    if ( !pword ) {
++      std::streambuf *const buf =
++        alloc_streambuf( ios.rdbuf(), mark, uri, loc );
++      ios.rdbuf( buf );
++      pword = buf;
++      ios.register_callback( internal::stream_callback, index );
++    }
++  }
++
++  template<typename charT, class Traits> inline
++    void detach( std::basic_ios<charT,Traits> &ios )
++  {
++    int const index = std::ios_base::xalloc();
++    if ( streambuf* const buf = static_cast<streambuf*>( ios.pword( index ) ) )
++    {
++      ios.pword( index ) = 0;
++      ios.rdbuf( buf->orig_streambuf() );
++      delete buf;
++    }
++  }
++
++  template<class StreamType>
++    class auto_attach {
++    public:
++      auto_attach(StreamType &stream,  int mark, zstring const& uri, QueryLoc const& loc) : stream_( stream )
++      {
++        attach(stream, mark, uri, loc);
++      }
++
++      ~auto_attach()
++      {
++        detach( stream_ );
++      }
++
++    private:
++      StreamType &stream_;
++    };
++
++}//namespace unparsed
++}//namesapce zorba
++
++#endif