Comment 53 for bug 1100282

Revision history for this message
Christian Heimes (heimes) wrote : Re: DoS through XML entity expansion

LXML still suggests a parser pool for threaded applications.

http://lxml.de/element_classes.html
To avoid interfering with other modules, however, it is usually a better idea to use a dedicated parser for each module (or a parser pool when using threads) and then register the required lookup scheme only for this parser.

Here is some example code from our code at work. We are using a custom Element class and thread local storage for parser instances.

import threading
from lxml import etree

class RestrictedElement(etree.ElementBase):
    __slots__ = ()
    # blacklist = (etree._Element, etree._ProcessingInstruction, etree._Comment)
    blacklist = etree._Element

    def __iter__(self):
        blacklist = self.blacklist
        for child in super(RestrictedElement, self).__iter__():
            if isinstance(child, blacklist):
                continue
            yield child

    def iterchildren(self, tag=None, reversed=False):
        blacklist = self.blacklist
        children = super(RestrictedElement, self).iterchildren(tag=tag,
                                                               reversed=reversed)
        for child in children:
            if isinstance(child, blacklist):
                continue
            yield child

    # you may need to overwrite getchildren, find, findall and more if you use them

class ParserTLS(threading.local):
    parser_cfg = {
        'resolve_entities': False,
        'remove_comments': True,
        'remove_pis': True,
    }

    @property
    def parser(self):
        parser = getattr(self, "_parser", None)
        if parser is None:
            parser = etree.XMLParser(**self.parser_cfg)
            lookup = etree.ElementDefaultClassLookup(element=RestrictedElement)
            parser.set_element_class_lookup(lookup)
            self._parser = parser
        return parser

if __name__ == "__main__":
    tls = ParserTLS()
    tree = etree.parse("test.xml", parser=tls.parser)
    print tree.getroot().text
    print list(tree.getroot().iterchildren())

@Thierry:
I'll ask on the PSRT list. My patch for expat won't be ready until Wednesday but we can release the restricted expat parsers classes for etree, sax and minidom as hotfixes. I'm waiting for some code review now. I also need to get back to the libxml2 guys ASAP.