Provide self and parent-or-self search functionality
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Wishlist
|
Unassigned |
Bug Description
This is an enhancement request.
Coming from XML/XSLT and XML::Twig in Perl, I have two enhancement requests for element-matching functionality in Beautiful Soup.
1. Implement a matches() method that tests the element itself using the standard name/attribute/
====
if mynode.
# ...
====
The name/attribute/
If the element matches, it should return itself; otherwise it should return None.
This is analogous to a direct node filtering predicate in XSLT:
====
$mynode[...filter-y stuff...]
====
2. Implement variants of find_parent() and find_parents() that *also* consider the element itself. For example,
====
mynode.
mynode.
# or
mynode.
mynode.
====
This is analogous to the ancestor-or-self axis in XSLT:
====
$mynode/
====
Again, this can be implemented explicitly in code, but it would be cleaner and clearer to use UI-consistent methods in content processing algorithms.
First, let's take inventory of what we have in the 4.13 branch that's in the vicinity of this wishlist item:
ElementFilter. match(PageEleme nt) -> bool match(Iterator[ PageElement] , ElementFilter) -> ResultSet
PageElement.
I'm OK with adding an Iterator that is the equivalent of ancestor-or-self. The name would probably be PageElement. self_and_ parents.
@property parents( self) -> Iterator[ PageElement] :
def self_and_
yield self
for i in self.parents:
yield i
I don't want to add more find_* methods, but since we already have find_parents, I'm OK with adding an include_self argument to it.
A PageElement. matches( ) method that takes the find_* arguments and returns a boolean is a little troubling for two reasons. First, it's another method with a very complicated signature. But more importantly, you're probably not calling PageElement. matches( ) just once. You're probably iterating over elements somehow and calling matches() each time. If you do that you're creating an identical ElementFilter for every element in the iterator, which is incredibly inefficient. It'd be much better to build the ElementFilter once. And there is a solution that works that way: either pass the ElementFilter into one of the find_* methods, or use PageElement. match() . So I'm not really sold on PageElement. matches( ).
However, we probably want to rename PageElement.match() before releasing 4.13, since the name makes it look like it returns a boolean, like ElementFilter. match() does. Maybe it should be called PageElement. filter( ).