Large number of DBs left around after test suite runs in buildbot

Bug #687951 reported by Tom Haddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad Buildbot Configuration
Fix Released
High
Stuart Bishop
zope.testing
Won't Fix
Undecided
Unassigned
zope.testrunner
Invalid
Undecided
Unassigned

Bug Description

We're getting disk space alerts on pilinut, and looking at where the disk space is being used we came across the following: https://pastebin.canonical.com/40714/

It seems the test suite doesn't clean up DBs after itself.

Related branches

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
Tom Haddon (mthaddon) wrote :

I've temporarily worked around this by stopping buildbot, dropping all the DBs, and then starting it up again.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Marking this as high because it's becoming an operational issue - buildbot slaves are running low on free disk space and alerting us regularly.

Changed in lpbuildbot:
importance: Undecided → High
Gary Poster (gary)
Changed in lpbuildbot:
status: New → Triaged
assignee: nobody → Gary Poster (gary)
Revision history for this message
Gary Poster (gary) wrote :

Edited IRC log conversation with lifeless.

gary: lifeless, if you are still around, can you point me to the code that makes the launchpad_ftest_NNN (e.g., launchpad_ftest_10452) databases for tests? I've gone a-hunting and have not found it yet--or at least mot recognized it
lifeless: gary: its the PgTestSetup stuff, now DatabaseFixture IIRC
lifeless: gary: to find it, start with layers.py
lifeless: gary: look at DatabaseLayer - you can see the fixture class it uses, and track back from there
lifeless: there is an lp specific thing in the same module that subclasses the more generic one
lifeless: the cause of the leakage will be one of a couple of things
lifeless: (I think)
lifeless: a) We're shutting down test processes via a signal that we don't trap
lifeless: solution : add a handler for that that will cause a stack unwind; stack unwinds should pretty reliably clean up
lifeless: b) Something is preventing the drop db happening - e.g. an appserver test process is slow to shutdown, we're not waiting for it to really shutdown in layer cleanup, and dropdb is thus failing because the db is in use
lifeless: c) there is a remaining bug in zope.testing causing our layer teardowns to not consistently happen
lifeless: I don't have a strong feeling for probability
gary: a) solution: stack unwinding: e.g., just raising a normal exception?
lifeless: fixing the functionallayer to use zope.testing.functional or whatever it is which (now) has a teardown function, to make the layer teardownable, is probably all thats needed
lifeless: gary: yeah, like test.in does
gary: gotcha, -ish
lifeless: gary: it may be missing a signal that we can expect
lifeless: I filed a bug about stale librarian pids
lifeless: in buildbot
lifeless: similar set of scenarios
lifeless: http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/annotate/head:/buildout-templates/bin/test.in#L75
lifeless: e.g. if we're also receiving HUP in the test environment
lifeless: then we need to add signal.signal(signal.SIGHUP, exit_with_atexit_handlers)

Revision history for this message
Gary Poster (gary) wrote :

I have done an expedient solution to the immediate problem: the beginning of test_on_merge removes the irrelevant databases. This solves the buildbot problem, but there is obviously an underlying problem in Launchpad. I will create another bug for that problem.

Revision history for this message
Gary Poster (gary) wrote :

lifeless rejected the expedient solution. mthaddon reports that we can postpone working on this till we return in 2011.

Changed in lpbuildbot:
assignee: Gary Poster (gary) → nobody
Stuart Bishop (stub)
Changed in lpbuildbot:
assignee: nobody → Stuart Bishop (stub)
Revision history for this message
Stuart Bishop (stub) wrote :
Download full text (12.2 KiB)

Garbage databases are easy enough to create by hitting Ctrl-C while running the test suite. The subprocess currently running tests aborts, possibly after calling cleanup, but the parent test process then barfs and certainly doesn't call any cleanup:

Traceback (most recent call last):
  File "bin/test", line 275, in <module>
      Set up canonical.testing.layers.FunctionalLayer
testrunner.run([])
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/__init__.py", line 32, in run
    failed = run_internal(defaults, args, script_parts=script_parts)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/__init__.py", line 45, in run_internal
    runner.run()
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 138, in run
    self.run_tests()
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 238, in run_tests
    layers_to_run, self.failures, self.errors)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 587, in resume_tests
    time.sleep(0.01) # Keep the loop from being too tight.
KeyboardInterrupt
SIGINT handled.
Traceback (most recent call last):
  File "bin/test", line 275, in <module>
    testrunner.run([])
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/__init__.py", line 32, in run
    failed = run_internal(defaults, args, script_parts=script_parts)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/__init__.py", line 45, in run_internal
    runner.run()
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 138, in run
    self.run_tests()
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 219, in run_tests
    setup_layers, self.failures, self.errors)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 367, in run_layer
    setup_layer(options, layer, setup_layers)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 643, in setup_layer
    setup_layer(options, base, setup_layers)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.testing-3.9.4_p3-py2.6.egg/zope/testing/testrunner/runner.py", line 648, in setup_layer
    layer.setUp()
  File "/home/stub/lp/trivial/lib/canonical/testing/profiled.py", line 29, in profiled_func
    return func(cls, *args, **kw)
  File "/home/stub/lp/trivial/lib/canonical/testing/layers.py", line 1042, in setUp
    FunctionalTestSetup().setUp()
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.app.testing-3.7.5-py2.6.egg/zope/app/testing/functional.py", line 215, in __init__
    self.app = Debugger(self.db, config_file)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.app.debug-3.4.1-py2.6.egg/zope/app/debug/debug.py", line 36, in __init__
    config(config_file)
  File "/home/stub/lp/lp-sourcedeps/eggs/zope.app.app...

Revision history for this message
Stuart Bishop (stub) wrote :

I think the core bugs are in zope.testing. From the traceback above, it is obvious that if tests are being run in a subprocess and that subprocess aborts, then the parent process doesn't cope. The tracebacks are there though so this might be acceptable. The other bug is that there are cases where layer tear down is not invoked which could be handled.

It looks like we can work around the immediate issue in Launchpad with garbage databases being left around by installing the teardown as an atexit handler (in addition to doing it in the layer tear down).

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 687951] Re: Large number of DBs left around after test suite runs in buildbot

I'd -really- like us to fix the root issues here, if zope.testing is
still giving us grief, we need to fix that otherwise every fixture we
add will need the same workaround. The previous atexit hooks masked
serious problems leading to leaked processes, and I'd hate to have to
analyze that again.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
tags: added: qa-needstesting
Changed in lpbuildbot:
status: Triaged → Fix Committed
j.c.sackett (jcsackett)
tags: added: qa-untestable
removed: qa-needstesting
Stuart Bishop (stub)
Changed in lpbuildbot:
status: Fix Committed → Fix Released
Revision history for this message
Marius Gedminas (mgedmin) wrote :

This needs to be fixed in zope.testrunner, not zope.testing.

Changed in zope.testing:
status: New → Won't Fix
Revision history for this message
Colin Watson (cjwatson) wrote :

The zope.testrunner project on Launchpad has been archived at the request of the Zope developers (see https://answers.launchpad.net/launchpad/+question/683589 and https://answers.launchpad.net/launchpad/+question/685285). If this bug is still relevant, please refile it at https://github.com/zopefoundation/zope.testrunner.

Changed in zope.testrunner:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.