Beautiful Soup

Speed up find* methods by 2x in common cases by checking the tag name first

Bug #1898212 reported by Morotti on 2020-10-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

The attached patch makes beautifulsoup twice as fast.
See function search_tag() in element.py. Edit the 3 lines.

Tested on beautifulsoup 4.9.1, lxml 4.5.2, python 3.7, windows 10.

How does this work?

find(something) iterates on the children elements, calling search_tag() to verify whether an element is a match for the filter, calling _matches() to perform the node comparison. It's basic AST tree parsing, nothing fancy.

This is horribly slow in python because it's running millions of lines of code, effectively a loop in a loop:
- Iterating through all the HTML/XML elements. Consider a bazillion elements in a large page.
- Comparing each element to the filter. Node comparison is a rather heavy recursive operation (see _matches).

If you think about how HTML is used and processed, a find("body") is going through a bazillion elements that are not a "body" hoping to find the one "body". Unfortunately the name comparison "body"==element.name is at the very end of the recursive node comparison, there's a tremendous waste of resources to get all the way down there only to notice the node doesn't match.

The patch adds an early comparison on the tag name. An element can't be a match if the name doesn't match, check for this early and abort never entering the heavy recursive comparison. (note: doesn't apply if there is a prefix on the node because prefixes affect names)

Result:

This improves search by tags, including find("tag") find_all("tag").
I suppose that's 90% of what people do with beautifulsoup so that's a massive speed improvement for the entire planet.

Let me know if you can see the patch. Note sure the attachment is working.

I have a code snippet for repro, text output showing software can run in 53% of the original time over various scenarios and screenshots of python profilers showing line by line information with bottlenecks. But I can't figure out how to attach any of that, doesn't seem to be support for attachments.

Really wish the project was on GitHub instead of this strange thing. ;)

Can probably do a similar thing with attributes.
If a search involves attributes and an element has no attributes, same thing, it's not going to match, could exit early.

Revision history for this message

Morotti (rmorotti) wrote on 2020-10-02:

patch for element.py Edit (45.4 KiB, text/plain)

Revision history for this message

Morotti (rmorotti) wrote on 2020-10-02:

perf_default.txt Edit (2.1 KiB, text/plain)

benchmark results, not patched.

Revision history for this message

Morotti (rmorotti) wrote on 2020-10-02:

perf_patch.txt Edit (2.1 KiB, text/plain)

benchmark result, patched

Revision history for this message

Morotti (rmorotti) wrote on 2020-10-02:

perf.py Edit (1.7 KiB, text/x-python)

repro code

Revision history for this message

Morotti (rmorotti) wrote on 2020-10-02:

profiler.png Edit (105.3 KiB, image/png)

profiler output with line by line sampling in prod.

On the top left, you can see the call stack.
find => find_all()[0] => _find_all => search => search_tag => _matches()

On the right, you can see source code of search_tag with line by line sampling. Most of the time is spent in this function, about half in subcalls to self._matches() that performs the node comparison.

P.S. The UI shows microsecond but this is not microsecond. Not sure what's the unit.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-10-02:

Thank you for doing this work. This is a significant optimization, explained well. Up to this point, all the performance work for Beautiful Soup has focused on speeding up the tree builder. I shouldn't be surprised that there was a simple search optimization yet to be made.

I've implemented a version of your patch as revision 591. I think this performance improvement is big enough to justify a point release on its own. Let me know if you have more work like this coming soon; otherwise I'll put out a release tomorrow.

I suspect the attribute optimization you mention will work, but not provide a very big benefit given the additional complexity. I'm interested in seeing what you have, though.

Leonard Richardson (leonardr) on 2020-10-03

Changed in beautifulsoup:
status:	New → Fix Committed
summary:	- patch: make beautifulsoup twice as fast + Speed up find* methods by 2x in common cases by checking the tag name + first