Comment 7 for bug 682685

Revision history for this message
Leonard Richardson (leonardr) wrote :

I still can't duplicate this problem, but if it's a memory-related problem that would explain the variability between systems.

Your patch is very tempting, but the __deepcopy__ implementation you have won't work, because a NavigableString's components are not immutable. Consider this code:

---
from bs4 import BeautifulSoup
import copy

soup = BeautifulSoup("<a>foo</a><b>bar</b>")
soup_2 = copy.deepcopy(soup)

foo = soup.find(text="foo")
foo_2 = soup_2.find(text="foo")
foo.insert_before(soup.find(text="bar"))

print soup
print soup_2

print foo.parent
print foo_2.parent
---

Superficially, soup_2 looks fine, but it's in an internally inconsistent state. That's because foo and foo_2 are the same object. They always have the same parent, and it is always an object in soup, not in soup_2.
When insert_before() changes the parent of foo, it updates the <a> tag and the <b> tag in soup, because those are foo's old and new parents. But the <a> tag and the <b> tag found in soup_2 don't get the message.

I can believe that the default deepcopy algorithm would be inefficient enough to exceed the maximum recursion depth on a densely interconnected data structure like a Beautiful Soup parse tree, but I don't understand the algorithm well enough to fix it in Beautiful Soup, and disabling the algorithm will cause subtle problems that can't be fixed. Whereas the current problem is a blatant problem that can usually be fixed--by extracting data from the parse tree before using it outside of Beautiful Soup.

I don't see a problem with your __copy__ implementation, because the string part of a NavigableString _is_ immutable. I'll be adding that to Beautiful Soup. Hopefully that will help the default deepcopy algorithm run better.