Merge lp:~stub/launchpad/branch-rewrite into lp:launchpad

Proposed by Stuart Bishop
Status: Merged
Approved by: Stuart Bishop
Approved revision: no longer in the source branch.
Merged at revision: 13863
Proposed branch: lp:~stub/launchpad/branch-rewrite
Merge into: lp:launchpad
Prerequisite: lp:~stub/launchpad/pgbouncer-fixture-noca
Diff against target: 284 lines (+132/-12)
6 files modified
lib/canonical/config/tests/test_database_config.py (+2/-1)
lib/canonical/launchpad/doc/canonical-config.txt (+2/-1)
lib/lp/codehosting/tests/test_rewrite.py (+89/-7)
lib/lp/testing/__init__.py (+24/-0)
lib/lp/testing/tests/test_pgsql.py (+1/-1)
scripts/branch-rewrite.py (+14/-2)
To merge this branch: bzr merge lp:~stub/launchpad/branch-rewrite
Reviewer Review Type Date Requested Status
Jeroen T. Vermeulen (community) Approve
Review via email: mp+73563@code.launchpad.net

This proposal supersedes a proposal from 2011-08-29.

Commit message

[r=jtv][bug=836662] Make branch-rewrite.py survive database outages

Description of the change

= Summary =

branch-rewrite.py does not reconnect after a database outage.

== Proposed fix ==

Make it reconnect after a database outage.

To post a comment you must log in.
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote : Posted in a previous version of this proposal
Download full text (5.1 KiB)

Hi Stuart,

Generally good, but I have a few things that look worth fixing.

=== modified file 'lib/lp/codehosting/tests/test_rewrite.py'
--- lib/lp/codehosting/tests/test_rewrite.py 2011-08-12 11:37:08 +0000
+++ lib/lp/codehosting/tests/test_rewrite.py 2011-08-29 20:15:59 +0000

@@ -177,7 +181,8 @@
         transaction.commit()
         rewriter.rewriteLine('/' + branch.unique_name + '/.bzr/README')
         rewriter.rewriteLine('/' + branch.unique_name + '/.bzr/README')
- logging_output_lines = self.getLoggerOutput(rewriter).strip().split('\n')
+ logging_output_lines = self.getLoggerOutput(
+ rewriter).strip().split('\n')

Cleaning out lint, I see. Thanks.

@@ -274,3 +280,62 @@
         # The script produces logging output, but not to stderr.
         self.assertEqual('', err)
         self.assertEqual(expected_lines, output_lines)
+
+
+class TestBranchRewriterScriptHandlesDisconnects(TestCaseWithFactory):
+ """Ensure branch-rewrite.py survives fastdowntime deploys."""
+ layer = LaunchpadScriptLayer
+
+ def setUp(self):
+ super(TestBranchRewriterScriptHandlesDisconnects, self).setUp()
+ self.pgbouncer = PGBouncerFixture()
+ self.addCleanup(self.pgbouncer.cleanUp)
+ self.pgbouncer.setUp()

Couldn't those last three lines be replaced with a simple

    self.pgbouncer = self.useFixture(PGBouncerFixture())

?

In fact I'm not even sure it's worth a setUp with the super() dance, and carrying self.pgbouncer from setUp to the tests.

+ def spawn(self):
+ script_file = os.path.join(
+ config.root, 'scripts', 'branch-rewrite.py')
+
+ self.rewriter_proc = subprocess.Popen(
+ [script_file], stdin=subprocess.PIPE, stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE, bufsize=0)

Again, is it worth keeping self.rewriter_proc on the test class? It just hides state. Why not make it a return value?

Moreover, is this process guaranteed to clean itself up? If not, then please ensure that it is. For instance, you could add something like "self.addCleanup(kill, rewriter_proc)" to spawn().

+ def request(self, query):
+ self.rewriter_proc.stdin.write(query + '\n')
+ return self.rewriter_proc.stdout.readline().rstrip('\n')

Do we know that the readline() will never hang under reasonable circumstances?

+ def test_disconnect(self):
+ self.spawn()

Sure "disconnect" is something that happens in this test, but is that one word really a good description of what you're testing here? Would it help to say something with "reconnects"?

+ # Everything should be working, and we get valid output.
+ out = self.request('foo')
+ assert out.endswith('/foo'), out

Why use assert here? Normally you'd say self.assertEndsWith(out, '/foo').

+ self.pgbouncer.stop()
+
+ # Now with pgbouncer down, we should get NULL messages and
+ # stderr spam, and this keeps happening. We test more than
+ # once to ensure that we will keep trying to reconnect even
+ # after several failures.
+ for count in range(5):
+ out = self.request('foo')

Is this race-free? I don't know ...

Read more...

review: Approve
Revision history for this message
Stuart Bishop (stub) wrote : Posted in a previous version of this proposal
Download full text (4.7 KiB)

On Wed, Aug 31, 2011 at 10:26 AM, Jeroen T. Vermeulen <email address hidden> wrote:
> === modified file 'lib/lp/codehosting/tests/test_rewrite.py'
> +class TestBranchRewriterScriptHandlesDisconnects(TestCaseWithFactory):
> +    """Ensure branch-rewrite.py survives fastdowntime deploys."""
> +    layer = LaunchpadScriptLayer
> +
> +    def setUp(self):
> +        super(TestBranchRewriterScriptHandlesDisconnects, self).setUp()
> +        self.pgbouncer = PGBouncerFixture()
> +        self.addCleanup(self.pgbouncer.cleanUp)
> +        self.pgbouncer.setUp()
>
> Couldn't those last three lines be replaced with a simple
>
>    self.pgbouncer = self.useFixture(PGBouncerFixture())
>
> ?
>
> In fact I'm not even sure it's worth a setUp with the super() dance, and carrying self.pgbouncer from setUp to the tests.

Yes, useFixture is nicer here. I've removed setUp() entirely and am
using useFixture in the tests.

> +    def spawn(self):
> +        script_file = os.path.join(
> +            config.root, 'scripts', 'branch-rewrite.py')
> +
> +        self.rewriter_proc = subprocess.Popen(
> +            [script_file], stdin=subprocess.PIPE, stdout=subprocess.PIPE,
> +            stderr=subprocess.PIPE, bufsize=0)
>
> Again, is it worth keeping self.rewriter_proc on the test class?  It just hides state.  Why not make it a return value?

If I don't hide state, I have to maintain state and pass it to the
request() helper :-)

> Moreover, is this process guaranteed to clean itself up?  If not, then please ensure that it is.  For instance, you could add something like "self.addCleanup(kill, rewriter_proc)" to spawn().

subprocess module handles this for us as soon as things go out of
scope. But it doesn't hurt to be explicit so I've added the cleanup.

> +    def request(self, query):
> +        self.rewriter_proc.stdin.write(query + '\n')
> +        return self.rewriter_proc.stdout.readline().rstrip('\n')
>
> Do we know that the readline() will never hang under reasonable circumstances?

> +    def test_disconnect(self):
> +        self.spawn()
>
> Sure "disconnect" is something that happens in this test, but is that one word really a good description of what you're testing here?  Would it help to say something with "reconnects"?

I've change the test name to test_reconnects_when_disconnected.

> +        # Everything should be working, and we get valid output.
> +        out = self.request('foo')
> +        assert out.endswith('/foo'), out
>
> Why use assert here?  Normally you'd say self.assertEndsWith(out, '/foo').

Because I'm always forgetting which assert methods are approved,
standard, testutils extensions, LP extensions :-)

Fixed, along with the other occurrences.

> +        self.pgbouncer.stop()
> +
> +        # Now with pgbouncer down, we should get NULL messages and
> +        # stderr spam, and this keeps happening. We test more than
> +        # once to ensure that we will keep trying to reconnect even
> +        # after several failures.
> +        for count in range(5):
> +            out = self.request('foo')
>
> Is this race-free?  I don't know if pgbouncer is a separate process or not; if it is, does stop() block until it's taken effect...

Read more...

Revision history for this message
Stuart Bishop (stub) wrote : Posted in a previous version of this proposal

readline() blocking shouldn't be a problem, but I've implemented a nonblocking readline helper and made use of it.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote : Posted in a previous version of this proposal

I don't know enough to be of much help with the raise-or-swallow issue. What the best choice is there may depend on what goes on higher up in the call tree. If any kind of transactional integrity is expected across the failure, then the exception probably needs to be re-raised.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Thanks for taking the trouble.

A tip for nonblocking_readline: the time calculations get simpler if you introduce a variable “deadline”:

 * deadline = start + timeout
 * “now < start - timeout” becomes “now < deadline”
 * “timeout - (now - start)” becomes “deadline - now”

(Both in their way involve computing the time remaining, but I'm not sure that making that explicit as well makes things any better.)

Jeroen

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/config/tests/test_database_config.py'
2--- lib/canonical/config/tests/test_database_config.py 2010-10-17 02:38:59 +0000
3+++ lib/canonical/config/tests/test_database_config.py 2011-09-03 05:39:35 +0000
4@@ -28,7 +28,8 @@
5 self.assertEquals('librarian', config.librarian.dbuser)
6
7 dbconfig.setConfigSection('librarian')
8- expected_db = 'dbname=%s' % DatabaseLayer._db_fixture.dbname
9+ expected_db = (
10+ 'dbname=%s host=localhost' % DatabaseLayer._db_fixture.dbname)
11 self.assertEquals(expected_db, dbconfig.rw_main_master)
12 self.assertEquals('librarian', dbconfig.dbuser)
13
14
15=== modified file 'lib/canonical/launchpad/doc/canonical-config.txt'
16--- lib/canonical/launchpad/doc/canonical-config.txt 2010-12-22 14:50:08 +0000
17+++ lib/canonical/launchpad/doc/canonical-config.txt 2011-09-03 05:39:35 +0000
18@@ -15,7 +15,8 @@
19
20 >>> from canonical.config import config
21 >>> from canonical.testing.layers import DatabaseLayer
22- >>> expected = 'dbname=%s' % DatabaseLayer._db_fixture.dbname
23+ >>> expected = (
24+ ... 'dbname=%s host=localhost' % DatabaseLayer._db_fixture.dbname)
25 >>> expected == config.database.rw_main_master
26 True
27 >>> config.database.db_statement_timeout is None
28
29=== modified file 'lib/lp/codehosting/tests/test_rewrite.py'
30--- lib/lp/codehosting/tests/test_rewrite.py 2011-08-12 11:37:08 +0000
31+++ lib/lp/codehosting/tests/test_rewrite.py 2011-09-03 05:39:35 +0000
32@@ -1,4 +1,4 @@
33-# Copyright 2009 Canonical Ltd. This software is licensed under the
34+# Copyright 2009-2011 Canonical Ltd. This software is licensed under the
35 # GNU Affero General Public License version 3 (see the file LICENSE).
36
37 """Tests for the dynamic RewriteMap used to serve branches over HTTP."""
38@@ -14,15 +14,21 @@
39 from zope.security.proxy import removeSecurityProxy
40
41 from canonical.config import config
42-from canonical.testing.layers import DatabaseFunctionalLayer
43+from canonical.testing.layers import (
44+ DatabaseFunctionalLayer,
45+ DatabaseLayer,
46+ )
47 from lp.code.interfaces.codehosting import branch_id_alias
48 from lp.codehosting.rewrite import BranchRewriter
49 from lp.codehosting.vfs import branch_id_to_path
50 from lp.services.log.logger import BufferLogger
51 from lp.testing import (
52 FakeTime,
53+ nonblocking_readline,
54+ TestCase,
55 TestCaseWithFactory,
56 )
57+from lp.testing.fixture import PGBouncerFixture
58
59
60 class TestBranchRewriter(TestCaseWithFactory):
61@@ -177,7 +183,8 @@
62 transaction.commit()
63 rewriter.rewriteLine('/' + branch.unique_name + '/.bzr/README')
64 rewriter.rewriteLine('/' + branch.unique_name + '/.bzr/README')
65- logging_output_lines = self.getLoggerOutput(rewriter).strip().split('\n')
66+ logging_output_lines = self.getLoggerOutput(
67+ rewriter).strip().split('\n')
68 self.assertEqual(2, len(logging_output_lines))
69 self.assertIsNot(
70 None,
71@@ -194,7 +201,8 @@
72 self.fake_time.advance(
73 config.codehosting.branch_rewrite_cache_lifetime + 1)
74 rewriter.rewriteLine('/' + branch.unique_name + '/.bzr/README')
75- logging_output_lines = self.getLoggerOutput(rewriter).strip().split('\n')
76+ logging_output_lines = self.getLoggerOutput(
77+ rewriter).strip().split('\n')
78 self.assertEqual(2, len(logging_output_lines))
79 self.assertIsNot(
80 None,
81@@ -246,7 +254,8 @@
82 # buffering, write a complete line of output.
83 for input_line in input_lines:
84 proc.stdin.write(input_line + '\n')
85- output_lines.append(proc.stdout.readline().rstrip('\n'))
86+ output_lines.append(
87+ nonblocking_readline(proc.stdout, 60).rstrip('\n'))
88 # If we create a new branch after the branch-rewrite.py script has
89 # connected to the database, or edit a branch name that has already
90 # been rewritten, both are rewritten successfully.
91@@ -260,17 +269,90 @@
92 'file:///var/tmp/bazaar.launchpad.dev/mirrors/%s/.bzr/README'
93 % branch_id_to_path(new_branch.id))
94 proc.stdin.write(new_branch_input + '\n')
95- output_lines.append(proc.stdout.readline().rstrip('\n'))
96+ output_lines.append(
97+ nonblocking_readline(proc.stdout, 60).rstrip('\n'))
98
99 edited_branch_input = '/%s/.bzr/README' % edited_branch.unique_name
100 expected_lines.append(
101 'file:///var/tmp/bazaar.launchpad.dev/mirrors/%s/.bzr/README'
102 % branch_id_to_path(edited_branch.id))
103 proc.stdin.write(edited_branch_input + '\n')
104- output_lines.append(proc.stdout.readline().rstrip('\n'))
105+ output_lines.append(
106+ nonblocking_readline(proc.stdout, 60).rstrip('\n'))
107
108 os.kill(proc.pid, signal.SIGINT)
109 err = proc.stderr.read()
110 # The script produces logging output, but not to stderr.
111 self.assertEqual('', err)
112 self.assertEqual(expected_lines, output_lines)
113+
114+
115+class TestBranchRewriterScriptHandlesDisconnects(TestCase):
116+ """Ensure branch-rewrite.py survives fastdowntime deploys."""
117+ layer = DatabaseLayer
118+
119+ def spawn(self):
120+ script_file = os.path.join(
121+ config.root, 'scripts', 'branch-rewrite.py')
122+
123+ self.rewriter_proc = subprocess.Popen(
124+ [script_file], stdin=subprocess.PIPE, stdout=subprocess.PIPE,
125+ stderr=subprocess.PIPE, bufsize=0)
126+
127+ self.addCleanup(self.rewriter_proc.terminate)
128+
129+ def request(self, query):
130+ self.rewriter_proc.stdin.write(query + '\n')
131+ self.rewriter_proc.stdin.flush()
132+
133+ # 60 second timeout as we might need to wait for the script to
134+ # finish starting up.
135+ result = nonblocking_readline(self.rewriter_proc.stdout, 60)
136+
137+ if result.endswith('\n'):
138+ return result[:-1]
139+ self.fail(
140+ "Incomplete line or no result retrieved from subprocess: %s"
141+ % repr(result.getvalue()))
142+
143+ def test_reconnects_when_disconnected(self):
144+ pgbouncer = self.useFixture(PGBouncerFixture())
145+
146+ self.spawn()
147+
148+ # Everything should be working, and we get valid output.
149+ out = self.request('foo')
150+ self.assertEndsWith(out, '/foo')
151+
152+ pgbouncer.stop()
153+
154+ # Now with pgbouncer down, we should get NULL messages and
155+ # stderr spam, and this keeps happening. We test more than
156+ # once to ensure that we will keep trying to reconnect even
157+ # after several failures.
158+ for count in range(5):
159+ out = self.request('foo')
160+ self.assertEqual(out, 'NULL')
161+
162+ pgbouncer.start()
163+
164+ # Everything should be working, and we get valid output.
165+ out = self.request('foo')
166+ self.assertEndsWith(out, '/foo')
167+
168+ def test_starts_with_db_down(self):
169+ pgbouncer = self.useFixture(PGBouncerFixture())
170+
171+ # Start with the database down.
172+ pgbouncer.stop()
173+
174+ self.spawn()
175+
176+ for count in range(5):
177+ out = self.request('foo')
178+ self.assertEqual(out, 'NULL')
179+
180+ pgbouncer.start()
181+
182+ out = self.request('foo')
183+ self.assertEndsWith(out, '/foo')
184
185=== modified file 'lib/lp/testing/__init__.py'
186--- lib/lp/testing/__init__.py 2011-08-19 18:20:58 +0000
187+++ lib/lp/testing/__init__.py 2011-09-03 05:39:35 +0000
188@@ -29,6 +29,7 @@
189 'logout',
190 'map_branch_contents',
191 'normalize_whitespace',
192+ 'nonblocking_readline',
193 'oauth_access_token_for',
194 'person_logged_in',
195 'quote_jquery_expression',
196@@ -69,6 +70,7 @@
197 import os
198 from pprint import pformat
199 import re
200+from select import select
201 import shutil
202 import subprocess
203 import sys
204@@ -1325,3 +1327,25 @@
205 def extract_lp_cache(text):
206 match = re.search(r'<script>LP.cache = (\{.*\});</script>', text)
207 return simplejson.loads(match.group(1))
208+
209+
210+def nonblocking_readline(instream, timeout):
211+ """Non-blocking readline.
212+
213+ Files must provide a valid fileno() method. This is a test helper
214+ as it is inefficient and unlikely useful for production code.
215+ """
216+ result = StringIO()
217+ start = now = time.time()
218+ deadline = start + timeout
219+ while (now < deadline and not result.getvalue().endswith('\n')):
220+ rlist = select([instream], [], [], deadline - now)
221+ if rlist:
222+ # Reading 1 character at a time is inefficient, but means
223+ # we don't need to implement put-back.
224+ next_char = os.read(instream.fileno(), 1)
225+ if next_char == "":
226+ break # EOF
227+ result.write(next_char)
228+ now = time.time()
229+ return result.getvalue()
230
231=== modified file 'lib/lp/testing/tests/test_pgsql.py'
232--- lib/lp/testing/tests/test_pgsql.py 2011-02-19 13:50:19 +0000
233+++ lib/lp/testing/tests/test_pgsql.py 2011-09-03 05:39:35 +0000
234@@ -53,7 +53,7 @@
235 fixture.setUp()
236 self.addCleanup(fixture.dropDb)
237 self.addCleanup(fixture.tearDown)
238- expected_value = 'dbname=%s' % fixture.dbname
239+ expected_value = 'dbname=%s host=localhost' % fixture.dbname
240 self.assertEqual(expected_value, dbconfig.rw_main_master)
241 self.assertEqual(expected_value, dbconfig.rw_main_slave)
242 with ConfigUseFixture(BaseLayer.appserver_config_name):
243
244=== modified file 'scripts/branch-rewrite.py'
245--- scripts/branch-rewrite.py 2011-09-02 10:50:34 +0000
246+++ scripts/branch-rewrite.py 2011-09-03 05:39:35 +0000
247@@ -1,6 +1,6 @@
248 #!/usr/bin/python -uS
249 #
250-# Copyright 2009 Canonical Ltd. This software is licensed under the
251+# Copyright 2009-2011 Canonical Ltd. This software is licensed under the
252 # GNU Affero General Public License version 3 (see the file LICENSE).
253
254 # pylint: disable-msg=W0403
255@@ -19,6 +19,8 @@
256
257 from canonical.database.sqlbase import ISOLATION_LEVEL_AUTOCOMMIT
258 from canonical.config import config
259+from canonical.launchpad.interfaces.lpstorm import ISlaveStore
260+from lp.code.model.branch import Branch
261 from lp.codehosting.rewrite import BranchRewriter
262 from lp.services.log.loglevels import INFO, WARNING
263 from lp.services.scripts.base import LaunchpadScript
264@@ -60,9 +62,19 @@
265 return
266 except KeyboardInterrupt:
267 sys.exit()
268- except:
269+ except Exception:
270 self.logger.exception('Exception occurred:')
271 print "NULL"
272+ # The exception might have been a DisconnectionError or
273+ # similar. Cleanup such as database reconnection will
274+ # not happen until the transaction is rolled back.
275+ # XXX StuartBishop 2011-08-31 bug=819282: We are
276+ # explicitly rolling back the store here as a workaround
277+ # instead of using transaction.abort()
278+ try:
279+ ISlaveStore(Branch).rollback()
280+ except Exception:
281+ self.logger.exception('Exception occurred in rollback:')
282
283
284 if __name__ == '__main__':