Cannot prettify soup with copied elements

Bug #1838903 reported by Rob Brackett
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

I have a project that uses `copy.copy()` extensively to make copies of elements in a soup (it diffs HTML documents). After updating to Beautiful Soup v4.8.0, I can no longer call `prettify()` on the soup instances with copied elements in them:

```py
from bs4 import BeautifulSoup
import copy

html = '''<!doctype html>
<html>
  <head><title>Hello BS4</title></head>
  <body>
    <p>Here is some text.</p>
  </body>
</html>
'''

# Make a soup
soup = BeautifulSoup(html, 'lxml')

# Add a copied element
paragraph2 = copy.copy(soup.p)
paragraph2.string.replace_with('Here is more text')
soup.body.append(paragraph2)

# Prettifying now raises an error in:
# bs4/element.py in _should_pretty_print()
soup.prettify()
```

`_should_pretty_print()` winds up raising `TypeError: argument of type 'NoneType' is not iterable` because `copy()` causes the `Tag` instance to be created with no builder, and Tags with no builder never get their `preserve_whitespace_tags` attribute set: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L767

Later on in `_should_pretty_print()` (an in several other spots), Beautiful Soup assumes `self.preserve_whitespace_tags` is an iterable, but in this case, it’s `None`.

It looks like this changed in rev 506: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/506

I use element copying extensively in an HTML diffing tool I work on in order to wrap changed parts of the document in tags indicating whether they were inserted or deleted, and in wrapping some `<style>` and `<script>` tags in `<template>` tags in order to delete them.

Rob Brackett (mr0grog)
description: updated
Revision history for this message
Rob Brackett (mr0grog) wrote :

Hi, any updates on this? Is there a good workaround for the regression?

Revision history for this message
Leonard Richardson (leonardr) wrote :

I've fixed this problem in two ways. As of revision 520, copies of tags propagate information originally obtained from the TreeBuilder, so .preserve_whitespace_tags won't be None. As of revision 518, _should_pretty_print() will function even if Tag.preserve_whitespace_tags _is_ None.

I don't see a good way to work around this without updating the code, so in revision 521 I made it possible to pass replacements for Tag and NavigableString into the BeautifulSoup constructor. That will make future workarounds easier.

Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Rob Brackett (mr0grog) wrote :

Thanks! It passes my unit tests now, at least :)

I’ll post back here if I see any related issues while testing it out tomorrow.

Revision history for this message
Rob Brackett (mr0grog) wrote :

I'm not seeing any related or follow-on issues after updating to the latest commit here. Thanks so much!

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.