Handling attributes whose value is the empty string as HTML boolean attributes

Bug #1915424 reported by Isaac Muse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

So, recently, I had a new issue opened up in Soup Sieve, and while in the beginning, I asserted that attributes should all be strings, I noticed that BS does a lot to ensure even if attribute values are not strings that they still get resolved to strings and not choke, so I ended up caving and made sure that Soup Sieve also normalizes attribute values that are not strings. 'None' being of particular interest.

Soup Sieve will now handle arbitrary random types now in attribute values, more to keep things from crashing, which is why I assume BS does it, so I'm fine with handling things like `None` even if I'd argue the user should never use `None` explicitly in the attribute. My interest is more in how BS handles an attribute `None` vs an empty string in HTML, and why it handles them differently.

This brings me to my question. In HTML, `foo=""` and `foo` are essentially treated the same. Using CSS selectors in any browser, `[foo=""]` will match both attributes with explicit empty strings (`foo=""`) and attributes with implied empty strings (`foo`).

The whole reason that the Soup Sieve issue was opened is that the user, in order to force BS to output a bare attribute in the form of `foo`, had to assign the variable a `None`. Interestingly, when BS imports a bare attribute, it stores it in its dictionary as `attrs['foo'] = ''` which I think is correct. But when outputting, it will not output it as `foo`, but will output it as `foo=""`. Hence, why the user forced it `None` as BS then treats that differently and will output a bare attribute `foo`.

Why are these (attribute value of `None` vs an empty string) treated differently in output? In HTML they are the same, so I would expect an attribute assigned an empty string or None to output the same. Generally, most HTML users would probably prefer the bare attribute. So can we have HTML formatter, by default, output bare attributes when the string is empty?

Related branches

Revision history for this message
Isaac Muse (facelessuser) wrote :

I do realize that XML should treat this case differently, and this would only be for HTML.

Revision history for this message
Isaac Muse (facelessuser) wrote :

Digging into this more, it seems if we wanted to make empty strings be bare attributes in just HTML, the HTML formatter(s) could be adjusted to return a None if the value is an empty string as Element.decode seems to format None values as bare attributes (though I don't think Beautiful Soup ever actually assigns None to an attribute itself).

I guess alternatively, for consistency, you could also do the opposite and just change Element.decode to output an attribute with a None value as an empty string `attr=""`.

For me, I'm just confused at the inconsistency of None vs an empty string as they should be treated the same (at least in my mind), whether it is turning both cases to bare attributes in HTML, or just outputting them in XML style always. It threw me for a loop that people would go out of their way to purposefully replace attributes with empty strings with None to just get a bare attribute.

In the end, I realize this is cosmetic (even if it tweaks some OCD part of my brain). I'm happy to provide a patch if we'd like to change the behavior, but I'll live if no change is desired as well :).

Revision history for this message
Leonard Richardson (leonardr) wrote :

My gut feeling is that this logic dates to a very old version of Beautiful Soup when I had a lot of poorly-thought-out beliefs about HTML. Possibly so old that HTML itself was different.

I'm fine with this change; my only concern is backwards compatibility with existing toolchains that use Beautiful Soiup.

Let's add an argument to the Formatter constructor that controls how bare attributes are output. I think it's reasonable to give this new behavior to the 'html5' formatter, leaving the default 'html' formatter alone, put this in a feature release and see if anyone complains.

Changed in beautifulsoup:
status: New → Confirmed
summary: - Handling of bare attributes in HTML
+ Handling attributes whose value is the empty string as HTML boolean
+ attributes
Revision history for this message
Leonard Richardson (leonardr) wrote :

Merge proposal adapted into revision 601.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Released in 4.10.0.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.