xml:parse() - infinite loop

Bug #1027270 reported by mb21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
Critical
Nicolae Brinza

Bug Description

"xmllint --encode ascii wiki.xml" reveals that for some reason the input file contains lots of uncommon UTF-8 characters (when viewed with cat and vim the font seems to substitutes those with the closest ASCII char).

Strangely it doesn't seem to be only one character but a combination of lines that provokes the behaviour (I tried removing some lines individually but couldn't reproduce after that).

Related branches

Revision history for this message
mb21 (mauro-bieg) wrote :
Revision history for this message
mb21 (mauro-bieg) wrote :

the input file

Changed in zorba:
assignee: nobody → Nicolae Brinza (nbrinza)
importance: Undecided → Medium
Revision history for this message
Chris Hillery (ceejatec) wrote :

I did some brief debugging: the method FragmentXmlLoader::loadXml() goes into an infinite loop with this input document. Specifically, when it starts parsing the <template head="R..." element, it repeatedly gets to line 332 in that file:

      if (theXQueryDiagnostics->errors().empty()
          &&
          theFragmentStream->current_offset == 0)
      {
        if (theFragmentStream->state == FragmentIStream::FRAGMENT_FIRST_START_DOC)
          FragmentXmlLoader::startDocument(theFragmentStream->ctxt->userData);
        xmlParseCharData(theFragmentStream->ctxt, 0);
        theFragmentStream->current_offset = getCurrentInputOffset(); // update current offset

And theFragmentStream->current_offset is set (again) to 0 at this point, meaning it will get to the same point the next time through, and so on.

Changed in zorba:
status: New → Confirmed
importance: Medium → Critical
milestone: none → 2.7
Revision history for this message
Chris Hillery (ceejatec) wrote :

Mauro - FYI, this wiki.xml file has no character references in it at all; did you put the right input file on this bug? In any case, it certainly does exhibit a bug.

summary: - parse-xml - endless 100%CPU with lots of character references
+ parse-xml - infinite loop
summary: - parse-xml - infinite loop
+ xml:parse() - infinite loop
Revision history for this message
mb21 (mauro-bieg) wrote :

@Chris, you are right, since I added the <?xml version="1.0" encoding="UTF-8" ?> xmllint also prints it properly. But with "xmllint --encode ascii wiki.xml" you get my describe behaviour, strange default..

Anyway, so all characters are valid UTF-8. But what I found is that most characters in that document aren't those they appear to be. For example most y's aren't actually the ordinary Y (&#121;) but rather the "Latin Capital Letter Y with hook" (&#435;). Similarily, some i's aren't actually the ordinary I (&#105;), but the "Cyrillic Small Letter Byelorussian-Ukrainian I" (&#1110). Hope that helps.

mb21 (mauro-bieg)
description: updated
description: updated
Changed in zorba:
status: Confirmed → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.