Merge lp:~stub/launchpad/replication into lp:launchpad/db-devel

Proposed by Stuart Bishop on 2010-01-15
Status: Merged
Approved by: Stuart Bishop on 2010-01-18
Approved revision: not available
Merged at revision: not available
Proposed branch: lp:~stub/launchpad/replication
Merge into: lp:launchpad/db-devel
Diff against target: 60 lines (+27/-2)
2 files modified
lib/canonical/config/schema-lazr.conf (+5/-0)
lib/canonical/launchpad/webapp/dbpolicy.py (+22/-2)
To merge this branch: bzr merge lp:~stub/launchpad/replication
Reviewer Review Type Date Requested Status
Henning Eggers (community) code 2010-01-15 Approve on 2010-01-18
Review via email: mp+17454@code.launchpad.net

Commit Message

Don't let replication lag checks by the appserver block long

To post a comment you must log in.
Stuart Bishop (stub) wrote :

Address Bug #504696. This is a cherry pick candidate, after manual tests on staging and edge demonstrate things are working as intended.

When things get busy, it can be slow to query the Slony-I tables to determine how lagged the slave database is. This is something we need to deal with, as we are sort of abusing this facility and I doubt it was intended for the sl_status view to be queried 20-30 times per second.

Rather than just letting the replication lag checks go slow, making users cry and timeouts rise, we put a timeout on the check. If we can't get the information we need in 250 ms, assume lag is bad and proceed.

This change cannot be tested by our test suite. It has been tested locally against a replicated environment and should be tested on staging next.

Henning Eggers (henninge) wrote :

Hi stub,
this sounds like a smart thing to do and the branch looks good. Only thing that'd confuse me is the ".get_one()[0]" construct which looks like strange semantics in the get_one method. I'd exepct it *not* to return a list. But you did not introduce that, so I'll just let it go.

Cheers,
Henning

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/config/schema-lazr.conf'
2--- lib/canonical/config/schema-lazr.conf 2009-12-27 00:54:45 +0000
3+++ lib/canonical/config/schema-lazr.conf 2010-01-15 12:01:26 +0000
4@@ -873,6 +873,11 @@
5 storm_cache: generational
6 storm_cache_size: 10000
7
8+# Assume the slave database is lagged if it takes more than this many
9+# milliseconds to calculate this information from the Slony-I tables.
10+# datatype: integer
11+lag_check_timeout: 250
12+
13 # If False, do not launch the appserver.
14 # datatype: boolean
15 launch: True
16
17=== modified file 'lib/canonical/launchpad/webapp/dbpolicy.py'
18--- lib/canonical/launchpad/webapp/dbpolicy.py 2009-12-02 19:33:08 +0000
19+++ lib/canonical/launchpad/webapp/dbpolicy.py 2010-01-15 12:01:26 +0000
20@@ -12,9 +12,11 @@
21 ]
22
23 from datetime import datetime, timedelta
24+import logging
25 from textwrap import dedent
26
27 from storm.cache import Cache, GenerationalCache
28+from storm.exceptions import TimeoutError
29 from storm.zope.interfaces import IZStorm
30 from zope.session.interfaces import ISession, IClientIdManager
31 from zope.component import getUtility
32@@ -291,8 +293,26 @@
33
34 # sl_status gives meaningful results only on the origin node.
35 master_store = self.getStore(MAIN_STORE, MASTER_FLAVOR)
36- return master_store.execute(
37- "SELECT replication_lag(%d)" % slave_node_id).get_one()[0]
38+ # If it takes more than (by default) 0.25 seconds to query the
39+ # replication lag, assume we are lagged. Normally the query
40+ # takes <20ms. This can happen during heavy updates, as the
41+ # Slony-I tables can get slow with lots of events. We use a
42+ # SAVEPOINT to conveniently reset the statement timeout.
43+ master_store.execute("""
44+ SAVEPOINT lag_check; SET LOCAL statement_timeout TO %d
45+ """ % config.launchpad.lag_check_timeout)
46+ try:
47+ try:
48+ return master_store.execute(
49+ "SELECT replication_lag(%d)" % slave_node_id).get_one()[0]
50+ except TimeoutError:
51+ logging.warn(
52+ 'Gave up querying slave lag after %d ms',
53+ (config.launchpad.lag_check_timeout))
54+ return timedelta(days=999) # A long, long time.
55+ finally:
56+ master_store.execute("ROLLBACK TO lag_check")
57+
58
59
60 def WebServiceDatabasePolicyFactory(request):

Subscribers

People subscribed via source and target branches

to status/vote changes: