Incorrect JSON serialization of supplementary plane code points

Bug #1025622 reported by Dennis Knochenwefel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
Undecided
Paul J. Lucas

Bug Description

this bug is a follow-up of bug #1024448

Currently, the result of the following JSONiq query:

  let $message := "👊"
  return { "message": $message }

is serialized into incorrect JSON:

  { "message" : "\ufffffff0\uffffff9f\uffffff91\uffffff8a" }

the correct result would be:

  { "message" : "\ud83d\udc4a" }

Explanation:

Characters from the supplementory plane are usually represented in utf-16 surrogate pairs within JSON results. The above result is in particular incorrect because JSON allows only 4 hex digits after '\u'. utf-16 surrogate pairs alway fit into a 4 hex digit or 2 x 4 hex digit window which is most probably the reason why utf-16 is used.

This has been greatly fixed in the JSON parser by Paul (see mp: https://code.launchpad.net/~paul-lucas/zorba/bug-1024448/+merge/115248 ), but it still needs to be fixed in the serializer.

@Paul: I'm not sure if you are the right person to assign this bug to?

thanks

Related branches

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

First, how does one execute a JSoniq query? If I put the above query into a file and do:

  bin/zorba -f -i -r --trailing-nl -q /tmp/foo.xq

I get:

  </tmp/foo.xq>:2,8: static error [err:XPST0003]: invalid expression; raised at .../src/compiler/translator/translator.cpp:11081

Changed in zorba:
status: New → Incomplete
Revision history for this message
Dennis Knochenwefel (dennis-knochenwefel) wrote :

I think that building zorba with option -DZORBA_WITH_JSON=ON is sufficient.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

I put some breakpoints in and it never hits my serialization code, so it's probably in the JSoniq serialization code.

Revision history for this message
Chris Hillery (ceejatec) wrote :

The problem is almost certainly in void serializer::json_emitter::emit_json_string(zstring string), serializer.cpp line 1206 or thereabouts, where it escapes invalid characters into unicode escape sequences. I have no idea how to do that any differently than it is, so Paul, please take a look and see if there are obvious logic problems.

Chris Hillery (ceejatec)
Changed in zorba:
status: Incomplete → Confirmed
Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

The problem with that code is that it serializes the string as a sequence of bytes (which is wrong) and not a sequence of either Unicode code-points or UTF-8 characters.

I'll fix it myself.

Changed in zorba:
status: Confirmed → In Progress
summary: - incorrect JSON serialization of supplementory plane code points
+ Incorrect JSON serialization of supplementory plane code points
summary: - Incorrect JSON serialization of supplementory plane code points
+ Incorrect JSON serialization of supplementary plane code points
Changed in zorba:
status: In Progress → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.