Provide self and parent-or-self search functionality

Bug #2052936 reported by Chris Papademetrious
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Wishlist
Unassigned

Bug Description

This is an enhancement request.

Coming from XML/XSLT and XML::Twig in Perl, I have two enhancement requests for element-matching functionality in Beautiful Soup.

1. Implement a matches() method that tests the element itself using the standard name/attribute/string filtering UI:

====
if mynode.matches(...filter-y stuff...):
    # ...
====

The name/attribute/string filter mechanisms can be coded in their various ways, but having code consistency in search-matching and self-matching is preferred.

If the element matches, it should return itself; otherwise it should return None.

This is analogous to a direct node filtering predicate in XSLT:

====
$mynode[...filter-y stuff...]
====

2. Implement variants of find_parent() and find_parents() that *also* consider the element itself. For example,

====
mynode.find_parent_or_self(...filter-y stuff...)
mynode.find_parents_or_self(...filter-y stuff...)

# or

mynode.find_parent(...filter-y stuff..., include_self=True)
mynode.find_parents(...filter-y stuff..., include_self=True)
====

This is analogous to the ancestor-or-self axis in XSLT:

====
$mynode/ancestor-or-self::*[...filter-y stuff...]
====

Again, this can be implemented explicitly in code, but it would be cleaner and clearer to use UI-consistent methods in content processing algorithms.

Revision history for this message
Leonard Richardson (leonardr) wrote (last edit ):

First, let's take inventory of what we have in the 4.13 branch that's in the vicinity of this wishlist item:

ElementFilter.match(PageElement) -> bool
PageElement.match(Iterator[PageElement], ElementFilter) -> ResultSet

I'm OK with adding an Iterator that is the equivalent of ancestor-or-self. The name would probably be PageElement.self_and_parents.

    @property
    def self_and_parents(self) -> Iterator[PageElement]:
        yield self
        for i in self.parents:
            yield i

I don't want to add more find_* methods, but since we already have find_parents, I'm OK with adding an include_self argument to it.

A PageElement.matches() method that takes the find_* arguments and returns a boolean is a little troubling for two reasons. First, it's another method with a very complicated signature. But more importantly, you're probably not calling PageElement.matches() just once. You're probably iterating over elements somehow and calling matches() each time. If you do that you're creating an identical ElementFilter for every element in the iterator, which is incredibly inefficient. It'd be much better to build the ElementFilter once. And there is a solution that works that way: either pass the ElementFilter into one of the find_* methods, or use PageElement.match(). So I'm not really sold on PageElement.matches().

However, we probably want to rename PageElement.match() before releasing 4.13, since the name makes it look like it returns a boolean, like ElementFilter.match() does. Maybe it should be called PageElement.filter().

Changed in beautifulsoup:
importance: Undecided → Wishlist
Revision history for this message
Chris Papademetrious (chrispitude) wrote :

The PageElement.self_and_parents Iterator and an include_self argument for find_parents() would be great, if you're able to squeeze them in!

For PageElement.matches(), my primary use would be code where I make decisions about restructuring content based on what exists. This would involve more random control logic than looping. (I am careful to use loops where appropriate, as runtime is important for the amount of content I process.)

I'm really just looking for a variant of find_*() that works on that element itself. It can even return the element itself on a positive match. For example, I might do something like this:

====
if this_tag.matches('div', class_='prolog') and
        this_tag.next_sibling and this_tag.next_sibling.matches('div', class_='abstract'):
    # do some stuff
    pass
====

Actually, XML::Twig has variants of matches() like next_sibling_matches() which elegantly handles the case where the next thing doesn't exist at all, but I have a feeling I'm wearing out my welcome by mentioning it. :) Maybe I should consider extending bs4 and adding all these convenience methods that I tend to use.

Revision history for this message
Leonard Richardson (leonardr) wrote :

What about a single iterator PageElement.self_and(other_iterator)?

e.g.

tag.self_and(tag.next_siblings)

Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Chris Papademetrious (chrispitude) wrote :

For the self-and-* matching functionality, for the following command:

====
tag.find_parent(True, {'data-base-uri': True}, include_self=True)
====

what would the tag.self_and() version of this command be?

For the self-only matching functionality, I really hope you consider implementing the matches() method. It would be extremely useful for our processing code. I understand your concerns about runtime, but runtime is not a factor for our application, whereas code clarity and maintainability is.

For example, we have code that processes HTML hierarchically like this:

====
for tag in reversed(soup.find_all(True)):
  if tag.matches(True, href=re.compile(...)):
    # do some stuff
  elif tag.matches(['div', 'body'], class_=['abstract', 'summary']):
    # do some stuff
  elif tag.matches(...):
    # ...
  elif ...
====

and the clarity and ease of modifying the filtering tests is paramount. The actual implementation of this code is much more complicated and harder to follow than the code above.

On a side note, the reason I iterate through the find_all() tags in reverse is that it allows me to modify the contents of a tag as much as I want without inadvertently breaking the "for" iteration (because I never traverse downward into the stuff I just modified). This code pattern works very well for hierarchical processing!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.