Merge into devel : retry-404 : Code : Launchpad itself

Status:

Merged

Approved by:

Данило Шеган on 2009-01-13

Approved revision:

no longer in the source branch.

Merged at revision:

not available

Proposed branch:

lp:~stub/launchpad/retry-404

Merge into:

lp:launchpad

To merge this branch:

bzr merge lp:~stub/launchpad/retry-404

Undecided

Won't Fix

Link a bug report

Reviewer	Review Type	Date Requested	Status
Данило Шеган (community)		2009-01-13	Approve on 2009-01-13
Review via email: mp+2805@code.launchpad.net

This proposal supersedes a proposal from 2009-01-09.

Revision history for this message

Stuart Bishop (stub) wrote on 2009-01-09: Posted in a previous version of this proposal

#

Addresses Bug #311690.

During normal operation, the slave databases lag behind the master
by 30-60 seconds. If a browser attempts to access a newly created,
and the database policy selects a slave database as the data source,
a 404 error page will be returned to the user. This is a surprising
result, given the same user might have just created the missing
information. This is most noticible when multiple tools are
being used to access Launchpad, such as Apport and a web browser
as the database selection policy cannot tell that the same person
is the operator.

To solve this, we cause the publisher to retry requests using the
master database as the default datasource instead of returning
the 404 error page if a slave was the default datasource. This is
similar to our existing deadlock and serialization exception handling,
and uses the same mechanism (the conflict handling for Zope's optimistic
transactions).

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2009-01-09: Posted in a previous version of this proposal

#

Download full text (3.6 KiB)

> === modified file 'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
> +
> +
> +A 404 error page is shown when code raises a LookupError. If a slave
> +database is being used, this might have been caused by replication lag
> +if the missing data was only recently created. To fix this surprising
> +error, requests are always retried using the master database before
> +returning a 404 error to the user.
> +
> + >>> anon_browser.handleErrors = True
> + >>> anon_browser.raiseHttpErrors = False
> +
> + # Confirm requests are going to the SLAVE
> + >>> anon_browser.open('http://launchpad.dev/+whichdb')
> + >>> print whichdb(anon_browser)
> + SLAVE
> +
> + # The slave database contains no data, but we don't get
> + # a 404 page - the request is retried against the MASTER.
> + >>> anon_browser.open('http://launchpad.dev/~stub')
> + >>> anon_browser.headers['Status']
> + '200 Ok'

I'm not sure I understand why would fail? In our tests, the SLAVE and the
MASTER are the same database, so how would a LookupError be raised in one
case, and not in the other?

> +
> + # 404s are still returned though if the data doesn't exist in the
> + # MASTER database either.
> + >>> anon_browser.open('http://launchpad.dev/~does-not-exist')
> + >>> anon_browser.headers['Status']
> + '404 Not Found'
> +
> + # This session is still using the SLAVE though by default.
> + >>> anon_browser.open('http://launchpad.dev/+whichdb')
> + >>> print whichdb(anon_browser)
> + SLAVE

> === modified file 'lib/canonical/launchpad/webapp/publication.py'
> --- lib/canonical/launchpad/webapp/publication.py 2008-11-04 16:25:36 +0000
> +++ lib/canonical/launchpad/webapp/publication.py 2009-01-09 09:12:16 +0000
> @@ -43,7 +43,8 @@
> import canonical.launchpad.webapp.adapter as da
> from canonical.launchpad.webapp.interfaces import (
> IDatabasePolicy, IPlacelessAuthUtility, IPrimaryContext,
> - ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError)
> + ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError,
> + IStoreSelector, MAIN_STORE, DEFAULT_FLAVOR, MASTER_FLAVOR)
> from canonical.launchpad.webapp.opstats import OpStats
> from canonical.launchpad.webapp.uri import URI, InvalidURIError
> from canonical.launchpad.webapp.vhosts import allvhosts
> @@ -431,6 +432,18 @@
> # the publication, so there's nothing we need to do here.
> pass
>
> + # If we get a LookupError and the default database being used is
> + # a replica, raise a Retry exception instead of returning
> + # the 404 error page. We do this in case the LookupError is
> + # caused by replication lag. Our database policy forces the
> + # use of the master database for retries.
> + if retry_allowed and isinstance(exc_info[1], LookupError):
> + store_selector = getUtility(IStoreSelector)
> + default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
> + master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
> + if default_store is not master_store:
> + raise Retry(exc_info)
> +
> # Reraise Re...

> === modified file 'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
> +
> +
> +A 404 error page is shown when code raises a LookupError. If a slave
> +database is being used, this might have been caused by replication lag
> +if the missing data was only recently created. To fix this surprising
> +error, requests are always retried using the master database before
> +returning a 404 error to the user.
> +
> +    >>> anon_browser.handleErrors = True
> +    >>> anon_browser.raiseHttpErrors = False
> +
> +    # Confirm requests are going to the SLAVE
> +    >>> anon_browser.open('http://launchpad.dev/+whichdb')
> +    >>> print whichdb(anon_browser)
> +    SLAVE
> +
> +    # The slave database contains no data, but we don't get
> +    # a 404 page - the request is retried against the MASTER.
> +    >>> anon_browser.open('http://launchpad.dev/~stub')
> +    >>> anon_browser.headers['Status']
> +    '200 Ok'

I'm not sure I understand why would fail? In our tests, the SLAVE and the
MASTER are the same database, so how would a LookupError be raised in one
case, and not in the other?

> +
> +    # 404s are still returned though if the data doesn't exist in the
> +    # MASTER database either.
> +    >>> anon_browser.open('http://launchpad.dev/~does-not-exist')
> +    >>> anon_browser.headers['Status']
> +    '404 Not Found'
> +
> +    # This session is still using the SLAVE though by default.
> +    >>> anon_browser.open('http://launchpad.dev/+whichdb')
> +    >>> print whichdb(anon_browser)
> +    SLAVE

> === modified file 'lib/canonical/launchpad/webapp/publication.py'
> --- lib/canonical/launchpad/webapp/publication.py       2008-11-04 16:25:36 +0000
> +++ lib/canonical/launchpad/webapp/publication.py       2009-01-09 09:12:16 +0000
> @@ -43,7 +43,8 @@
>  import canonical.launchpad.webapp.adapter as da
>  from canonical.launchpad.webapp.interfaces import (
>      IDatabasePolicy, IPlacelessAuthUtility, IPrimaryContext,
> -    ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError)
> +    ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError,
> +    IStoreSelector, MAIN_STORE, DEFAULT_FLAVOR, MASTER_FLAVOR)
>  from canonical.launchpad.webapp.opstats import OpStats
>  from canonical.launchpad.webapp.uri import URI, InvalidURIError
>  from canonical.launchpad.webapp.vhosts import allvhosts
> @@ -431,6 +432,18 @@
>              # the publication, so there's nothing we need to do here.
>              pass
>  
> +        # If we get a LookupError and the default database being used is
> +        # a replica, raise a Retry exception instead of returning
> +        # the 404 error page. We do this in case the LookupError is
> +        # caused by replication lag. Our database policy forces the
> +        # use of the master database for retries.
> +        if retry_allowed and isinstance(exc_info[1], LookupError):
> +            store_selector = getUtility(IStoreSelector)
> +            default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
> +            master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
> +            if default_store is not master_store:
> +                raise Retry(exc_info)
> +
>          # Reraise Retry exceptions rather than log.
>          if retry_allowed and isinstance(
>              exc_info[1], (Retry, DisconnectionError, IntegrityError,

If you look at the block at the context boundary, you'll see that it removes
the variables related to tracking the number of ticks in the request.

So you should either do the same thing, or move your check into a
shouldRetryLookup() method and it to the list of condition on which that block
is triggered.

review: Needs Fixing

Revision history for this message

Stuart Bishop (stub) wrote on 2009-01-12: Posted in a previous version of this proposal

#

Download full text (4.0 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Jan 9, 2009 at 10:12 PM, Francis J. Lacoste wrote:
> Review: Needs Fixing
>> === modified file
'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
>> +
>> +
>> +A 404 error page is shown when code raises a LookupError. If a slave
>> +database is being used, this might have been caused by replication lag
>> +if the missing data was only recently created. To fix this surprising
>> +error, requests are always retried using the master database before
>> +returning a 404 error to the user.
>> +
>> + >>> anon_browser.handleErrors = True
>> + >>> anon_browser.raiseHttpErrors = False
>> +
>> + # Confirm requests are going to the SLAVE
>> + >>> anon_browser.open('http://launchpad.dev/+whichdb')
>> + >>> print whichdb(anon_browser)
>> + SLAVE
>> +
>> + # The slave database contains no data, but we don't get
>> + # a 404 page - the request is retried against the MASTER.
>> + >>> anon_browser.open('http://launchpad.dev/~stub')
>> + >>> anon_browser.headers['Status']
>> + '200 Ok'
>
> I'm not sure I understand why would fail? In our tests, the SLAVE and the
> MASTER are the same database, so how would a LookupError be raised in one
> case, and not in the other?

At the top of this page test, we change this so the SLAVE database is
an empty database. I've also added some code up there that confirms
the database actually is empty (the other tests didn't care - they
just used the database name as reported by PostgreSQL).

>> === modified file 'lib/canonical/launchpad/webapp/publication.py'
>> --- lib/canonical/launchpad/webapp/publication.py 2008-11-04
16:25:36 +0000
>> +++ lib/canonical/launchpad/webapp/publication.py 2009-01-09
09:12:16 +0000
>> @@ -43,7 +43,8 @@
>> import canonical.launchpad.webapp.adapter as da
>> from canonical.launchpad.webapp.interfaces import (
>> IDatabasePolicy, IPlacelessAuthUtility, IPrimaryContext,
>> - ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError)
>> + ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError,
>> + IStoreSelector, MAIN_STORE, DEFAULT_FLAVOR, MASTER_FLAVOR)
>> from canonical.launchpad.webapp.opstats import OpStats
>> from canonical.launchpad.webapp.uri import URI, InvalidURIError
>> from canonical.launchpad.webapp.vhosts import allvhosts
>> @@ -431,6 +432,18 @@
>> # the publication, so there's nothing we need to do here.
>> pass
>>
>> + # If we get a LookupError and the default database being used is
>> + # a replica, raise a Retry exception instead of returning
>> + # the 404 error page. We do this in case the LookupError is
>> + # caused by replication lag. Our database policy forces the
>> + # use of the master database for retries.
>> + if retry_allowed and isinstance(exc_info[1], LookupError):
>> + store_selector = getUtility(IStoreSelector)
>> + default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
>> + master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
>> + if default_store is not master_store:
>> + raise Retry(exc_info)
>> +
>...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, Jan 9, 2009 at 10:12 PM, Francis J. Lacoste  wrote:
> Review: Needs Fixing
>> === modified file
'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
>> +
>> +
>> +A 404 error page is shown when code raises a LookupError. If a slave
>> +database is being used, this might have been caused by replication lag
>> +if the missing data was only recently created. To fix this surprising
>> +error, requests are always retried using the master database before
>> +returning a 404 error to the user.
>> +
>> +    >>> anon_browser.handleErrors = True
>> +    >>> anon_browser.raiseHttpErrors = False
>> +
>> +    # Confirm requests are going to the SLAVE
>> +    >>> anon_browser.open('http://launchpad.dev/+whichdb')
>> +    >>> print whichdb(anon_browser)
>> +    SLAVE
>> +
>> +    # The slave database contains no data, but we don't get
>> +    # a 404 page - the request is retried against the MASTER.
>> +    >>> anon_browser.open('http://launchpad.dev/~stub')
>> +    >>> anon_browser.headers['Status']
>> +    '200 Ok'
>
> I'm not sure I understand why would fail? In our tests, the SLAVE and the
> MASTER are the same database, so how would a LookupError be raised in one
> case, and not in the other?

At the top of this page test, we change this so the SLAVE database is
an empty database. I've also added some code up there that confirms
the database actually is empty (the other tests didn't care - they
just used the database name as reported by PostgreSQL).

>> === modified file 'lib/canonical/launchpad/webapp/publication.py'
>> --- lib/canonical/launchpad/webapp/publication.py       2008-11-04
16:25:36 +0000
>> +++ lib/canonical/launchpad/webapp/publication.py       2009-01-09
09:12:16 +0000
>> @@ -43,7 +43,8 @@
>>  import canonical.launchpad.webapp.adapter as da
>>  from canonical.launchpad.webapp.interfaces import (
>>      IDatabasePolicy, IPlacelessAuthUtility, IPrimaryContext,
>> -    ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError)
>> +    ILaunchpadRoot, IOpenLaunchBag, OffsiteFormPostError,
>> +    IStoreSelector, MAIN_STORE, DEFAULT_FLAVOR, MASTER_FLAVOR)
>>  from canonical.launchpad.webapp.opstats import OpStats
>>  from canonical.launchpad.webapp.uri import URI, InvalidURIError
>>  from canonical.launchpad.webapp.vhosts import allvhosts
>> @@ -431,6 +432,18 @@
>>              # the publication, so there's nothing we need to do here.
>>              pass
>>
>> +        # If we get a LookupError and the default database being used is
>> +        # a replica, raise a Retry exception instead of returning
>> +        # the 404 error page. We do this in case the LookupError is
>> +        # caused by replication lag. Our database policy forces the
>> +        # use of the master database for retries.
>> +        if retry_allowed and isinstance(exc_info[1], LookupError):
>> +            store_selector = getUtility(IStoreSelector)
>> +            default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
>> +            master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
>> +            if default_store is not master_store:
>> +                raise Retry(exc_info)
>> +
>>          # Reraise Retry exceptions rather than log.
>>          if retry_allowed and isinstance(
>>              exc_info[1], (Retry, DisconnectionError, IntegrityError,
>
> If you look at the block at the context boundary, you'll see that it removes
> the variables related to tracking the number of ticks in the request.
>
> So you should either do the same thing, or move your check into a
> shouldRetryLookup() method and it to the list of condition on which that block
> is triggered.

I've refactored the handleException method so the tickcounts are
handled correctly. I created a should_retry() helper in the
handleException method to encapsulate the logic.

- --
Stuart Bishop
http://www.stuartbishop.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: http://getfiregpg.org

iD8DBQFJay/jAfqZj7rGN0oRAmFrAJ9jadxhwFI4FkOH+LixckOO6wcG7QCcCDL5
6Aj4SeUZACZzgZF15sVn/CY=
=WIhw
-----END PGP SIGNATURE-----

404.diff

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2009-01-12: Posted in a previous version of this proposal

#

Looks good.

status approved

--
Francis J. Lacoste
<email address hidden>

review: Approve

Revision history for this message

Stuart Bishop (stub) wrote on 2009-01-13:

#

Diff of unreviewed changes is at https://pastebin.canonical.com/12581/

Migrating the 'should this exception cause a Retry' logic to the IDatabasePolicy doesn't seem to fit very well, so I have left it the way it is.

Revision history for this message

Данило Шеган (danilo) wrote on 2009-01-13:

#

All looks good.

review: Approve

 === modified file 'lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt'
 --- lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt	2009-01-09 09:12:16 +0000
 +++ lib/canonical/launchpad/pagetests/standalone/xx-dbpolicy.txt	2009-01-12 11:29:50 +0000
@@ -17,6 +17,24 @@
      ...     """)
      >>> config.push('empty_slave', config_overlay)
++We should confirm that the empty database is as empty as we hope it is.
++
++    >>> from zope.component import getUtility
++    >>> from canonical.launchpad.webapp.interfaces import (
++    ...     IStoreSelector, MAIN_STORE, MASTER_FLAVOR, SLAVE_FLAVOR)
++    >>> from canonical.launchpad.database.person import Person
++    >>> slave_store = getUtility(IStoreSelector).get(
++    ...     MAIN_STORE, SLAVE_FLAVOR)
++    >>> master_store = getUtility(IStoreSelector).get(
++    ...     MAIN_STORE, MASTER_FLAVOR)
++    >>> slave_store.find(Person).count()
++    0
++    >>> master_store.find(Person).count() > 0
++    True
++
++This helper parses the output of the +whichdb view (which unfortunately
++needs to be created externally to this pagetest).
++
      >>> def whichdb(browser):
      ...     dbname = extract_text(find_tag_by_id(browser.contents, 'dbname'))
      ...     if dbname == 'launchpad_ftest':
 === modified file 'lib/canonical/launchpad/webapp/publication.py'
 --- lib/canonical/launchpad/webapp/publication.py	2009-01-09 09:12:16 +0000
 +++ lib/canonical/launchpad/webapp/publication.py	2009-01-12 11:53:42 +0000
@@ -431,23 +431,48 @@
              # The exception wasn't raised in the middle of the traversal nor
              # the publication, so there's nothing we need to do here.
              pass
--
--        # If we get a LookupError and the default database being used is
--        # a replica, raise a Retry exception instead of returning
--        # the 404 error page. We do this in case the LookupError is
--        # caused by replication lag. Our database policy forces the
--        # use of the master database for retries.
--        if retry_allowed and isinstance(exc_info[1], LookupError):
--            store_selector = getUtility(IStoreSelector)
--            default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
--            master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
--            if default_store is not master_store:
--                raise Retry(exc_info)
--
--        # Reraise Retry exceptions rather than log.
--        if retry_allowed and isinstance(
--            exc_info[1], (Retry, DisconnectionError, IntegrityError,
--                          TransactionRollbackError)):
++
++        def should_retry(exc_info):
++            if not retry_allowed:
++                return False
++
++            # If we get a LookupError and the default database being
++            # used is a replica, raise a Retry exception instead of
++            # returning the 404 error page. We do this in case the
++            # LookupError is caused by replication lag. Our database
++            # policy forces the use of the master database for retries.
++            if isinstance(exc_info[1], LookupError):
++                store_selector = getUtility(IStoreSelector)
++                default_store = store_selector.get(MAIN_STORE, DEFAULT_FLAVOR)
++                master_store = store_selector.get(MAIN_STORE, MASTER_FLAVOR)
++                if default_store is master_store:
++                    return False
++                else:
++                    return True
++
++            # Retry exceptions need to be propagated so they are
++            # retried. Retry exceptions occur when an optimistic
++            # transaction failed, such as we detected two transactions
++            # attempting to modify the same resource.
++            # DisconnectionError and TransactionRollbackError indicate
++            # a database transaction failure, and should be retried
++            # The appserver detects the error state, and a new database
++            # connection is opened allowing the appserver to cope with
++            # database or network outages.
++            # An IntegrityError may be caused when we insert a row
++            # into the database that already exists, such as two requests
++            # doing an insert-or-update. It may succeed if we try again.
++            if isinstance(exc_info[1], (Retry, DisconnectionError,
++                IntegrityError, TransactionRollbackError)):
++                return True
++
++            return False
++
++        # Reraise Retry exceptions ourselves rather than invoke
++        # our superclass handleException method, as it will log OOPS
++        # reports etc. This would be incorrect, as transaction retry
++        # is a normal part of operation.
++        if should_retry(exc_info):
              if request.supportsRetry():
                  # Remove variables used for counting ticks as this request is
                  # going to be retried.
@@ -456,9 +481,11 @@
              if isinstance(exc_info[1], Retry):
                  raise
              raise Retry(exc_info)
++
          superclass = zope.app.publication.browser.BrowserPublication
--        superclass.handleException(self, object, request, exc_info,
--                                   retry_allowed)
++        superclass.handleException(
++            self, object, request, exc_info, retry_allowed)
++
          # If it's a HEAD request, we don't care about the body, regardless of
          # exception.
          # UPSTREAM: Should this be part of zope,

Launchpad itself

Merge lp:~stub/launchpad/retry-404 into lp:launchpad

Commit message

Description of the change

Subscribers