Merge lp:~stub/launchpad/retry-404 into lp:launchpad

Proposed by Stuart Bishop
Status: Merged
Approved by: Данило Шеган
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~stub/launchpad/retry-404
Merge into: lp:launchpad
To merge this branch: bzr merge lp:~stub/launchpad/retry-404
Reviewer Review Type Date Requested Status
Данило Шеган (community) Approve
Review via email: mp+2805@code.launchpad.net

This proposal supersedes a proposal from 2009-01-09.

To post a comment you must log in.
Revision history for this message
Stuart Bishop (stub) wrote : Posted in a previous version of this proposal

Addresses Bug #311690.

During normal operation, the slave databases lag behind the master
by 30-60 seconds. If a browser attempts to access a newly created,
and the database policy selects a slave database as the data source,
a 404 error page will be returned to the user. This is a surprising
result, given the same user might have just created the missing
information. This is most noticible when multiple tools are
being used to access Launchpad, such as Apport and a web browser
as the database selection policy cannot tell that the same person
is the operator.

To solve this, we cause the publisher to retry requests using the
master database as the default datasource instead of returning
the 404 error page if a slave was the default datasource. This is
similar to our existing deadlock and serialization exception handling,
and uses the same mechanism (the conflict handling for Zope's optimistic
transactions).

Revision history for this message
Francis J. Lacoste (flacoste) wrote : Posted in a previous version of this proposal
Download full text (3.6 KiB)

> === modified file 'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
> +
> +
> +A 404 error page is shown when code raises a LookupError. If a slave
> +database is being used, this might have been caused by replication lag
> +if the missing data was only recently created. To fix this surprising
> +error, requests are always retried using the master database before
> +returning a 404 error to the user.
> +
> + >>> anon_browser.handleErrors = True
> + >>> anon_browser.raiseHttpErrors = False
> +
> + # Confirm requests are going to the SLAVE
> + >>> anon_browser.open('http://launchpad.dev/+whichdb')
> + >>> print whichdb(anon_browser)
> + SLAVE
> +
> + # The slave database contains no data, but we don't get
> + # a 404 page - the request is retried against the MASTER.
> + >>> anon_browser.open('http://launchpad.dev/~stub')
> + >>> anon_browser.headers['Status']
> + '200 Ok'

I'm not sure I understand why would fail? In our tests, the SLAVE and the
MASTER are the same database, so how would a LookupError be raised in one
case, and not in the other?

> +
> + # 404s are still returned though if the data doesn't exist in the
> + # MASTER database either.
> + >>> anon_browser.open('http://launchpad.dev/~does-not-exist')
> + >>> anon_browser.headers['Status']
> + '404 Not Found'
> +
> + # This session is still using the SLAVE though by default.
> + >>> anon_browser.open('http://launchpad.dev/+whichdb')
> + >>> print whichdb(anon_browser)
> + SLAVE

> === modified file 'lib/canonical/launchpad/webapp/publication.py'
> --- lib/canonical/launchpad/webapp/publication.py 2008-11-04 16:25:36 +0000
> +++ lib/canonical/launchpad/webapp/publication.py 2009-01-09 09:12:16 +0000
> @@ -43,7 +43,8 @@
> import canonical.launchpad.webapp.adapter as da
> from canonical.launchpad.webapp.interfaces import (
> IDatabasePolicy, IPlacelessAuthUtility, IPrimaryContext,
> - ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError)
> + ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError,
> + IStoreSelector, MAIN_STORE, DEFAULT_FLAVOR, MASTER_FLAVOR)
> from canonical.launchpad.webapp.opstats import OpStats
> from canonical.launchpad.webapp.uri import URI, InvalidURIError
> from canonical.launchpad.webapp.vhosts import allvhosts
> @@ -431,6 +432,18 @@
> # the publication, so there's nothing we need to do here.
> pass
>
> + # If we get a LookupError and the default database being used is
> + # a replica, raise a Retry exception instead of returning
> + # the 404 error page. We do this in case the LookupError is
> + # caused by replication lag. Our database policy forces the
> + # use of the master database for retries.
> + if retry_allowed and isinstance(exc_info[1], LookupError):
> + store_selector = getUtility(IStoreSelector)
> + default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
> + master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
> + if default_store is not master_store:
> + raise Retry(exc_info)
> +
> # Reraise Re...

Read more...

review: Needs Fixing
Revision history for this message
Stuart Bishop (stub) wrote : Posted in a previous version of this proposal
Download full text (4.0 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Jan 9, 2009 at 10:12 PM, Francis J. Lacoste wrote:
> Review: Needs Fixing
>> === modified file
'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
>> +
>> +
>> +A 404 error page is shown when code raises a LookupError. If a slave
>> +database is being used, this might have been caused by replication lag
>> +if the missing data was only recently created. To fix this surprising
>> +error, requests are always retried using the master database before
>> +returning a 404 error to the user.
>> +
>> + >>> anon_browser.handleErrors = True
>> + >>> anon_browser.raiseHttpErrors = False
>> +
>> + # Confirm requests are going to the SLAVE
>> + >>> anon_browser.open('http://launchpad.dev/+whichdb')
>> + >>> print whichdb(anon_browser)
>> + SLAVE
>> +
>> + # The slave database contains no data, but we don't get
>> + # a 404 page - the request is retried against the MASTER.
>> + >>> anon_browser.open('http://launchpad.dev/~stub')
>> + >>> anon_browser.headers['Status']
>> + '200 Ok'
>
> I'm not sure I understand why would fail? In our tests, the SLAVE and the
> MASTER are the same database, so how would a LookupError be raised in one
> case, and not in the other?

At the top of this page test, we change this so the SLAVE database is
an empty database. I've also added some code up there that confirms
the database actually is empty (the other tests didn't care - they
just used the database name as reported by PostgreSQL).

>> === modified file 'lib/canonical/launchpad/webapp/publication.py'
>> --- lib/canonical/launchpad/webapp/publication.py 2008-11-04
16:25:36 +0000
>> +++ lib/canonical/launchpad/webapp/publication.py 2009-01-09
09:12:16 +0000
>> @@ -43,7 +43,8 @@
>> import canonical.launchpad.webapp.adapter as da
>> from canonical.launchpad.webapp.interfaces import (
>> IDatabasePolicy, IPlacelessAuthUtility, IPrimaryContext,
>> - ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError)
>> + ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError,
>> + IStoreSelector, MAIN_STORE, DEFAULT_FLAVOR, MASTER_FLAVOR)
>> from canonical.launchpad.webapp.opstats import OpStats
>> from canonical.launchpad.webapp.uri import URI, InvalidURIError
>> from canonical.launchpad.webapp.vhosts import allvhosts
>> @@ -431,6 +432,18 @@
>> # the publication, so there's nothing we need to do here.
>> pass
>>
>> + # If we get a LookupError and the default database being used is
>> + # a replica, raise a Retry exception instead of returning
>> + # the 404 error page. We do this in case the LookupError is
>> + # caused by replication lag. Our database policy forces the
>> + # use of the master database for retries.
>> + if retry_allowed and isinstance(exc_info[1], LookupError):
>> + store_selector = getUtility(IStoreSelector)
>> + default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
>> + master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
>> + if default_store is not master_store:
>> + raise Retry(exc_info)
>> +
>...

Read more...

1=== modified file 'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
2--- lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt 2009-01-09 09:12:16 +0000
3+++ lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt 2009-01-12 11:29:50 +0000
4@@ -17,6 +17,24 @@
5 ... """)
6 >>> config.push('empty_slave', config_overlay)
7
8+We should confirm that the empty database is as empty as we hope it is.
9+
10+ >>> from zope.component import getUtility
11+ >>> from canonical.launchpad.webapp.interfaces import (
12+ ... IStoreSelector, MAIN_STORE, MASTER_FLAVOR, SLAVE_FLAVOR)
13+ >>> from canonical.launchpad.database.person import Person
14+ >>> slave_store = getUtility(IStoreSelector).get(
15+ ... MAIN_STORE, SLAVE_FLAVOR)
16+ >>> master_store = getUtility(IStoreSelector).get(
17+ ... MAIN_STORE, MASTER_FLAVOR)
18+ >>> slave_store.find(Person).count()
19+ 0
20+ >>> master_store.find(Person).count() > 0
21+ True
22+
23+This helper parses the output of the +whichdb view (which unfortunately
24+needs to be created externally to this pagetest).
25+
26 >>> def whichdb(browser):
27 ... dbname = extract_text(find_tag_by_id(browser.contents, 'dbname'))
28 ... if dbname == 'launchpad_ftest':
29
30=== modified file 'lib/canonical/launchpad/webapp/publication.py'
31--- lib/canonical/launchpad/webapp/publication.py 2009-01-09 09:12:16 +0000
32+++ lib/canonical/launchpad/webapp/publication.py 2009-01-12 11:53:42 +0000
33@@ -431,23 +431,48 @@
34 # The exception wasn't raised in the middle of the traversal nor
35 # the publication, so there's nothing we need to do here.
36 pass
37-
38- # If we get a LookupError and the default database being used is
39- # a replica, raise a Retry exception instead of returning
40- # the 404 error page. We do this in case the LookupError is
41- # caused by replication lag. Our database policy forces the
42- # use of the master database for retries.
43- if retry_allowed and isinstance(exc_info[1], LookupError):
44- store_selector = getUtility(IStoreSelector)
45- default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
46- master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
47- if default_store is not master_store:
48- raise Retry(exc_info)
49-
50- # Reraise Retry exceptions rather than log.
51- if retry_allowed and isinstance(
52- exc_info[1], (Retry, DisconnectionError, IntegrityError,
53- TransactionRollbackError)):
54+
55+ def should_retry(exc_info):
56+ if not retry_allowed:
57+ return False
58+
59+ # If we get a LookupError and the default database being
60+ # used is a replica, raise a Retry exception instead of
61+ # returning the 404 error page. We do this in case the
62+ # LookupError is caused by replication lag. Our database
63+ # policy forces the use of the master database for retries.
64+ if isinstance(exc_info[1], LookupError):
65+ store_selector = getUtility(IStoreSelector)
66+ default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
67+ master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
68+ if default_store is master_store:
69+ return False
70+ else:
71+ return True
72+
73+ # Retry exceptions need to be propagated so they are
74+ # retried. Retry exceptions occur when an optimistic
75+ # transaction failed, such as we detected two transactions
76+ # attempting to modify the same resource.
77+ # DisconnectionError and TransactionRollbackError indicate
78+ # a database transaction failure, and should be retried
79+ # The appserver detects the error state, and a new database
80+ # connection is opened allowing the appserver to cope with
81+ # database or network outages.
82+ # An IntegrityError may be caused when we insert a row
83+ # into the database that already exists, such as two requests
84+ # doing an insert-or-update. It may succeed if we try again.
85+ if isinstance(exc_info[1], (Retry, DisconnectionError,
86+ IntegrityError, TransactionRollbackError)):
87+ return True
88+
89+ return False
90+
91+ # Reraise Retry exceptions ourselves rather than invoke
92+ # our superclass handleException method, as it will log OOPS
93+ # reports etc. This would be incorrect, as transaction retry
94+ # is a normal part of operation.
95+ if should_retry(exc_info):
96 if request.supportsRetry():
97 # Remove variables used for counting ticks as this request is
98 # going to be retried.
99@@ -456,9 +481,11 @@
100 if isinstance(exc_info[1], Retry):
101 raise
102 raise Retry(exc_info)
103+
104 superclass = zope.app.publication.browser.BrowserPublication
105- superclass.handleException(self, object, request, exc_info,
106- retry_allowed)
107+ superclass.handleException(
108+ self, object, request, exc_info, retry_allowed)
109+
110 # If it's a HEAD request, we don't care about the body, regardless of
111 # exception.
112 # UPSTREAM: Should this be part of zope,
Revision history for this message
Francis J. Lacoste (flacoste) wrote : Posted in a previous version of this proposal

Looks good.

  status approved

--
Francis J. Lacoste
<email address hidden>

review: Approve
Revision history for this message
Stuart Bishop (stub) wrote :

Diff of unreviewed changes is at https://pastebin.canonical.com/12581/

Migrating the 'should this exception cause a Retry' logic to the IDatabasePolicy doesn't seem to fit very well, so I have left it the way it is.

Revision history for this message
Данило Шеган (danilo) wrote :

All looks good.

review: Approve