Zorba

xml:parse() - infinite loop

Bug #1027270 reported by mb21 on 2012-07-20

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Zorba	Fix Released	Critical	Nicolae Brinza	Zorba 2.7 "Gaia"

Bug Description

"xmllint --encode ascii wiki.xml" reveals that for some reason the input file contains lots of uncommon UTF-8 characters (when viewed with cat and vim the font seems to substitutes those with the closest ASCII char).

Strangely it doesn't seem to be only one character but a combination of lines that provokes the behaviour (I tried removing some lines individually but couldn't reproduce after that).

See original description

Related branches

lp:~nbrinza/zorba/parse-fragment

Merged into lp:zorba at revision 11208

Chris Hillery: Approve on 2013-01-30

Nicolae Brinza: Approve on 2013-01-20

Revision history for this message

mb21 (mauro-bieg) wrote on 2012-07-20:

#1

reproduce Edit (659 bytes, text/plain)

Revision history for this message

mb21 (mauro-bieg) wrote on 2012-07-20:

#2

wiki.xml Edit (4.7 KiB, text/xml)

the input file

Matthias Brantner (matthias-brantner) on 2012-07-20

Changed in zorba:
assignee:	nobody → Nicolae Brinza (nbrinza)
importance:	Undecided → Medium

Revision history for this message

Chris Hillery (ceejatec) wrote on 2012-07-20:

#3

I did some brief debugging: the method FragmentXmlLoader::loadXml() goes into an infinite loop with this input document. Specifically, when it starts parsing the <template head="R..." element, it repeatedly gets to line 332 in that file:

      if (theXQueryDiagnostics->errors().empty()
          &&
          theFragmentStream->current_offset == 0)
      {
        if (theFragmentStream->state == FragmentIStream::FRAGMENT_FIRST_START_DOC)
          FragmentXmlLoader::startDocument(theFragmentStream->ctxt->userData);
        xmlParseCharData(theFragmentStream->ctxt, 0);
        theFragmentStream->current_offset = getCurrentInputOffset(); // update current offset

And theFragmentStream->current_offset is set (again) to 0 at this point, meaning it will get to the same point the next time through, and so on.

Changed in zorba:
status:	New → Confirmed
importance:	Medium → Critical
milestone:	none → 2.7

Revision history for this message

Chris Hillery (ceejatec) wrote on 2012-07-20:

#4

Mauro - FYI, this wiki.xml file has no character references in it at all; did you put the right input file on this bug? In any case, it certainly does exhibit a bug.

summary:	- parse-xml - endless 100%CPU with lots of character references + parse-xml - infinite loop
summary:	- parse-xml - infinite loop + xml:parse() - infinite loop

Revision history for this message

mb21 (mauro-bieg) wrote on 2012-07-20:

#5

@Chris, you are right, since I added the <?xml version="1.0" encoding="UTF-8" ?> xmllint also prints it properly. But with "xmllint --encode ascii wiki.xml" you get my describe behaviour, strange default..

Anyway, so all characters are valid UTF-8. But what I found is that most characters in that document aren't those they appear to be. For example most y's aren't actually the ordinary Y (y) but rather the "Latin Capital Letter Y with hook" (Ƴ). Similarily, some i's aren't actually the ordinary I (i), but the "Cyrillic Small Letter Byelorussian-Ukrainian I" (&#1110). Hope that helps.

mb21 (mauro-bieg) on 2012-07-20

description:	updated
description:	updated

Zorba Build Bot (zorba-buildbot) on 2012-08-30

Changed in zorba:
status:	Confirmed → Fix Committed

Dana Florescu (dflorescu) on 2012-10-24

Changed in zorba:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.