postgresql charm race condition in allowed-units

Bug #1273636 reported by Diogo Matsubara
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
postgresql (Juju Charms Collection)
Fix Released
Critical
Unassigned

Bug Description

I'm working on the test-catalog charm to support postgresql in preparation for the move of the OIL infrastructure to prodstack and the test-catalog relation with postgresql is triggering a race condition.

The charms used are: lp:charms/postgresql (revno 82) and lp:~matsubara/charms/precise/test-catalog/swap-to-postgresql and the merge proposal is here: https://code.launchpad.net/~matsubara/charms/precise/test-catalog/swap-to-postgresql/+merge/203307

Steps to reproduce:

juju deploy --repository . local:test-catalog
juju deploy --repository . local:postgresql
juju add-relation test-catalog:db postgresql:db

At this point test-catalog's db-relation-changed fails with the following error: http://pastebin.ubuntu.com/6831296/

Using juju from my laptop to canonistack, I can't reproduce this error all the time, most of the time the db-relation hooks work as intended.

Using juju from bastion (a machine inside the DC) and deploying against serverstack, I can reliably reproduce the error. If I add the following patch: http://pastebin.ubuntu.com/6831342/ to test-catalog, then most of the time I can get the db-relation hooks to just work as intended (i.e. the test-catalog-database script correctly initializes the db in PG).

Any ideas on how to debug this further? Is the test-catalog changes in the MP just plain wrong? In theory (and based on the doc for the postgresql charm) allowed-units is the right way to verify the database is ready for access from related units.

Related branches

description: updated
Revision history for this message
Stuart Bishop (stub) wrote :

The client code looks fine. It will only work with a single unit PostgreSQL service, but that is what is being tested here. 'hostname' is unnecessarily being set in relation-joined, but that should be harmless.

To debug this, we either need to step through or grab the logs from the server side. In particular, the charm unit-postgresql-0.log and /var/log/postgresql/postgresql-9.1-main.log.

What should be happening in the server's db-relation-changed hook:

 - pg_hba.conf rewritten. Authentication credentials and allowed-units set on the relation
 - PostgreSQL reloaded or restarted

The client should not see the relation data change until after the server's db-relation-changed hook has completed, so there should be no race condition between relation-set publishing allowed-units and the PostgreSQL reload. I think what must be happening is a logic error and PostgreSQL not being reloaded, or the reload is failing (eg. bad syntax in a config file), or the reload is not taking effect immediately.

Stuart Bishop (stub)
Changed in charms:
status: New → Triaged
importance: Undecided → Critical
status: Triaged → Incomplete
Revision history for this message
Stuart Bishop (stub) wrote :

Is this with the MaaS provider? We also need to confirm that the private-address on the client's relation matches what PostgreSQL actually sees when a connection attempt is made. It might be a provider specific bug like Bug #1250435 (https://<email address hidden>/msg01373.html)

Revision history for this message
Stuart Bishop (stub) wrote :

Looking at the provided logs, it appears that the private-address matches the IP address being refused access.

2014-01-28 01:37:15 INFO worker.uniter.jujuc server.go:102 running hook tool "relation-set" ["hostname=10.5.0.7" "data
base=testcatalog"]
2014-01-28 01:37:21 INFO juju.worker.uniter context.go:323 HOOK FATAL: no pg_hba.conf entry for host "10.5.0.7", user
 "db_18_test_catalog", database "testcatalog", SSL off

I think this discounts IP address mismatch problems I mentioned in comment#2

Revision history for this message
Diogo Matsubara (matsubara) wrote :

Hi Stuart ,

thanks for helping with this. I'm attaching the requested logs plugs the unit-test-catalog-0.log. I reproduced the bug again to grab the logs so the timestamps will be different than the ones in the logs provided in the original report.

Revision history for this message
Diogo Matsubara (matsubara) wrote :
Revision history for this message
Diogo Matsubara (matsubara) wrote :
Revision history for this message
Diogo Matsubara (matsubara) wrote :

Oh, and to answer your question in comment #2, this is not with the MAAS provider. This happened in the serverstack cloud which runs openstack (but I'm not sure which version of openstack)

Stuart Bishop (stub)
Changed in charms:
status: Incomplete → Triaged
Revision history for this message
Stuart Bishop (stub) wrote :

I discovered two different methods of issuing a 'reload', and I think what might be happening is the configuration reload is sometimes failing, but being repaired in a subsequent hook. This would create a window of time where the auth configuration is out of date.

I've been refactoring the service startup, stop & reload functionality. There is a reasonable chance it fixes the problem, and if not, at least improves the logging in this area and we can trace it further.

lp:~stub/charms/postgresql/pg93 contains the updates.

Revision history for this message
Stuart Bishop (stub) wrote :
Revision history for this message
Diogo Matsubara (matsubara) wrote :

Hi Stuart, I retried with your branch. Here are the logs.

Revision history for this message
Diogo Matsubara (matsubara) wrote :
Revision history for this message
Diogo Matsubara (matsubara) wrote :
Revision history for this message
Diogo Matsubara (matsubara) wrote :

More debugging with Stuart, here are the findings:

<stub> matsubara: Gah. You did this deliberately. The race is hard to spot, because the generated database name was test-catalog, but the one your charm is testcatalog
<stub> matsubara: Don't worry about fsync, I've identified the race.
<matsubara> stub, well, the charm did allow us to change the db name, so I'm working with the defined interface
<stub> https://pastebin.canonical.com/104024/
<stub> matsubara: yes. What you are doing is supposed to work, and even works in my test suite. But I can't guarantee the order of hooks in my test suite.
<stub> So the server -joined hook runs first, and grants access to the client to the generated database. The server doesn't yet know that the client wants a different database name.
<stub> Then the client sets the database name in its joined hook
<stub> And then runs its -changed hook, which attempts to connect to the database it wants to use.
<stub> And this fails because the server is still in the process of creating the requested database.
<stub> And this would be obvious if the database names were not so similar ;)
<matsubara> stub, so, should I change the test-catalog naming convention or is this something that you can change in the PG charm to deal with? I guess others might run in the same problem if they want to set the db name to something else.
<stub> matsubara: It is a bug in the PostgreSQL charm.
<stub> I'm trying to think of the best way to fix the race
<matsubara> stub, nasty bug, glad I exposed it before others start to use it more heavily. I'll update the bug report with the IRC log of the finding and let you work. Thanks a lot for your help!

Revision history for this message
Stuart Bishop (stub) wrote :

I think everything will work as it should if the client checks that relation_get('database') matches the expected database name before proceeding. This is additional burden on the client, but I don't think it can be avoided.

@hook()
def db_relation_changed():
    if local_unit() not in relation_get('allowed-units').split():
        return # Unit not yet authorized
    if config('database') !- relation_get('database'):
        return # Database name change not yet in effect

   [... do yo thang ...]

Revision history for this message
Diogo Matsubara (matsubara) wrote :

Hi Stuart, I did the changes you suggested and they work like a charm (doh, bad pun, I know :-)). One thing though, relation_get() might return None so the split() call would fail but I understand the general meaning of your suggestion. Thanks for helping with this issue!

Stuart Bishop (stub)
Changed in charms:
status: Triaged → In Progress
Marco Ceppi (marcoceppi)
affects: charms → postgresql (Juju Charms Collection)
Stuart Bishop (stub)
Changed in postgresql (Juju Charms Collection):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.