rabbitmq services may not be started when add_vhost is attempted

Bug #1500552 reported by Ryan Beisner
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Expired
Medium
Unassigned
rabbitmq-server (Juju Charms Collection)
Invalid
High
Unassigned

Bug Description

I've observed this in Trusty-Kilo test targets with a 3-node native cluster. I believe there is a race where the add_vhost is called, and the rabbitmq service may not be running on the unit.

2015-09-16 19:10:45 DEBUG worker.uniter.jujuc server.go:159 hook context id "rabbitmq-server/0-cluster-relation-changed-2445711715711195588"; dir "/var/lib/juju/agents/unit-rabbitmq-server-0/charm"
2015-09-16 19:10:45 INFO juju-log cluster:1: /usr/bin/rsync -r --delete --executability /var/lib/juju/agents/unit-rabbitmq-server-0/charm/scripts/collect_rabbitmq_stats.sh /usr/local/bin/collect_rabbitmq_stats.sh
2015-09-16 19:10:45 INFO worker.uniter.jujuc server.go:158 running hook tool "juju-log" ["Writing file /etc/cron.d/rabbitmq-stats root:root 444"]
2015-09-16 19:10:45 DEBUG worker.uniter.jujuc server.go:159 hook context id "rabbitmq-server/0-cluster-relation-changed-2445711715711195588"; dir "/var/lib/juju/agents/unit-rabbitmq-server-0/charm"
2015-09-16 19:10:45 INFO juju-log cluster:1: Writing file /etc/cron.d/rabbitmq-stats root:root 444
2015-09-16 19:10:45 INFO worker.uniter.jujuc server.go:158 running hook tool "relation-ids" ["--format=json" "nrpe-external-master"]
2015-09-16 19:10:45 DEBUG worker.uniter.jujuc server.go:159 hook context id "rabbitmq-server/0-cluster-relation-changed-2445711715711195588"; dir "/var/lib/juju/agents/unit-rabbitmq-server-0/charm"
2015-09-16 19:10:45 INFO cluster-relation-changed Error: rabbit application is not running on node rabbit@juju-osci-sv05-machine-2.
2015-09-16 19:10:45 INFO cluster-relation-changed * Suggestion: start it with "rabbitmqctl start_app" and try again
2015-09-16 19:10:45 INFO worker.uniter.jujuc server.go:158 running hook tool "juju-log" ["-l" "DEBUG" "Command '['/usr/sbin/rabbitmqctl', 'list_vhosts']' returned non-zero exit status 2"]
2015-09-16 19:10:45 DEBUG worker.uniter.jujuc server.go:159 hook context id "rabbitmq-server/0-cluster-relation-changed-2445711715711195588"; dir "/var/lib/juju/agents/unit-rabbitmq-server-0/charm"
2015-09-16 19:10:45 DEBUG juju-log cluster:1: Command '['/usr/sbin/rabbitmqctl', 'list_vhosts']' returned non-zero exit status 2
2015-09-16 19:10:46 INFO cluster-relation-changed Error: rabbit application is not running on node rabbit@juju-osci-sv05-machine-2.
2015-09-16 19:10:46 INFO cluster-relation-changed * Suggestion: start it with "rabbitmqctl start_app" and try again
2015-09-16 19:10:46 INFO cluster-relation-changed Traceback (most recent call last):
2015-09-16 19:10:46 INFO cluster-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/cluster-relation-changed", line 810, in <module>
2015-09-16 19:10:46 INFO cluster-relation-changed hooks.execute(sys.argv)
2015-09-16 19:10:46 INFO cluster-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/core/hookenv.py", line 672, in execute
2015-09-16 19:10:46 INFO cluster-relation-changed self._hooks[hook_name]()
2015-09-16 19:10:46 INFO cluster-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/cluster-relation-changed", line 362, in cluster_changed
2015-09-16 19:10:46 INFO cluster-relation-changed update_nrpe_checks()
2015-09-16 19:10:46 INFO cluster-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/cluster-relation-changed", line 575, in update_nrpe_checks
2015-09-16 19:10:46 INFO cluster-relation-changed rabbit.create_vhost(vhost)
2015-09-16 19:10:46 INFO cluster-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 106, in create_vhost
2015-09-16 19:10:46 INFO cluster-relation-changed subprocess.check_call(cmd)
2015-09-16 19:10:46 INFO cluster-relation-changed File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
2015-09-16 19:10:46 INFO cluster-relation-changed raise CalledProcessError(retcode, cmd)
2015-09-16 19:10:46 INFO cluster-relation-changed subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'add_vhost', 'nagios-rabbitmq-server-0']' returned non-zero exit status 2
2015-09-16 19:10:46 INFO juju.worker.uniter.context context.go:543 handling reboot
2015-09-16 19:10:46 ERROR juju.worker.uniter.operation runhook.go:103 hook "cluster-relation-changed" failed: exit status 1
2015-09-16 19:10:46 DEBUG juju.worker.uniter modes.go:31 [AGENT-STATUS] failed: run relation-changed (1; rabbitmq-server/2) hook
2015-09-16 19:10:46 INFO juju.worker.uniter modes.go:543 ModeAbide exiting
2015-09-16 19:10:46 INFO juju.worker.uniter modes.go:541 ModeHookError starting
2015-09-16 19:10:46 DEBUG juju.worker.uniter.filter filter.go:597 want resolved event
2015-09-16 19:10:46 DEBUG juju.worker.uniter.filter filter.go:591 want forced upgrade true
2015-09-16 19:10:46 DEBUG juju.worker.uniter.filter filter.go:727 no new charm event
2015-09-16 19:10:46 DEBUG juju.worker.uniter modes.go:31 [AGENT-STATUS] error: hook failed: "cluster-relation-changed"
2015-09-16 19:10:49 DEBUG juju.worker.uniter.filter filter.go:509 got leader settings change: ok=true

Related branches

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Confirmed
importance: Undecided → Medium
milestone: none → 15.10
importance: Medium → High
Changed in rabbitmq-server (Juju Charms Collection):
assignee: nobody → Jorge Niedbalski (niedbalski)
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

@beisner,

Could you upload the full unit log?

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

There are two evident anomalies on this case:

1) The update_nrpe_checks method is being triggered even if no "nrpe-external-master" relation exists, which is wrong. This
call should be conditional to the existence of that relation.

2) The create_vhost method doesn't have a retry mechanism in case of failure.

I still need the full logs from the affected unit to see any other important detail, while I am trying
to reproduce this issue.

For the 2 before mentioned situations I linked an MP.

tags: added: sts
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

After reviewing the full logs (available on http://10.245.162.77:8080/job/charm_amulet_test/6470/).

From juju-osci-sv05-machine-4

The cluster is sane:

2015-09-16 19:10:39 INFO juju-log cluster:1: cluster status is Cluster status of node 'rabbit@juju-osci-sv05-machine-4' ...
[{nodes,[{disc,['rabbit@juju-osci-sv05-machine-4']}]},
 {running_nodes,['rabbit@juju-osci-sv05-machine-4']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]

At this point it tries to cluster with juju-osci-sv05-machine-2

2015-09-16 19:10:40 INFO juju-log cluster:1: Clustering with remote rabbit host (juju-osci-sv05-machine-2).

At this point the node rabbit@juju-osci-sv05-machine-4, stops itself.

2015-09-16 19:10:40 INFO cluster-relation-changed Stopping node 'rabbit@juju-osci-sv05-machine-4' ...

Then the reported issue here is a counter effect of the same.

2015-09-16 19:10:41 INFO worker.uniter.jujuc server.go:158 running hook tool "juju-log" ["Failed to cluster with juju-osci-sv05-machine-2."]

So, it seems that this is failing on hooks/rabbit_utils:cluster_with method on the following call (Line 302)

        try:
            cmd = [RABBITMQ_CTL, 'stop_app']
            subprocess.check_call(cmd)
            cmd = [RABBITMQ_CTL, cluster_cmd, 'rabbit@%s' % node]
            try:
                subprocess.check_output(cmd, stderr=subprocess.STDOUT)
            except subprocess.CalledProcessError as e:
                if not e.returncode == 2 or \
                        "{ok,already_member}" not in e.output:
                    raise e

So, at this point this exception is being catch'd and this is going unnoticed, everything from here
such as add_vhost , etc. Will fail since the unit is stopped.

Changed in rabbitmq-server (Juju Charms Collection):
status: Confirmed → In Progress
James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
milestone: 15.10 → 16.01
Changed in rabbitmq-server (Juju Charms Collection):
assignee: Jorge Niedbalski (niedbalski) → nobody
James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
milestone: 16.01 → 16.04
James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
milestone: 16.04 → 16.07
Liam Young (gnuoy)
Changed in rabbitmq-server (Juju Charms Collection):
milestone: 16.07 → 16.10
James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
milestone: 16.10 → 17.01
Revision history for this message
Francis Ginther (fginther) wrote :
Download full text (4.0 KiB)

Looks like we hit this with a newton deployment from landscape autopilot.

[from unit-rabbitmq-server-1.log]
2017-01-10 14:24:39 INFO config-changed Error: unable to connect to node 'rabbit@juju-b03454-1-lxd-1': nodedown
2017-01-10 14:24:39 INFO config-changed
2017-01-10 14:24:39 INFO config-changed DIAGNOSTICS
2017-01-10 14:24:39 INFO config-changed ===========
2017-01-10 14:24:39 INFO config-changed
2017-01-10 14:24:39 INFO config-changed attempted to contact: ['rabbit@juju-b03454-1-lxd-1']
2017-01-10 14:24:39 INFO config-changed
2017-01-10 14:24:39 INFO config-changed rabbit@juju-b03454-1-lxd-1:
2017-01-10 14:24:39 INFO config-changed * connected to epmd (port 4369) on juju-b03454-1-lxd-1
2017-01-10 14:24:39 INFO config-changed * epmd reports: node 'rabbit' not running at all
2017-01-10 14:24:39 INFO config-changed no other nodes on juju-b03454-1-lxd-1
2017-01-10 14:24:39 INFO config-changed * suggestion: start the node
2017-01-10 14:24:39 INFO config-changed
2017-01-10 14:24:39 INFO config-changed current node details:
2017-01-10 14:24:39 INFO config-changed - node name: 'rabbitmq-cli-13458@juju-b03454-1-lxd-1'
2017-01-10 14:24:39 INFO config-changed - home dir: /var/lib/rabbitmq
2017-01-10 14:24:39 INFO config-changed - cookie hash: PQqnLl4zaYTm8d9gVBRhRQ==
2017-01-10 14:24:39 INFO config-changed
2017-01-10 14:24:39 INFO config-changed Traceback (most recent call last):
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/config-changed", line 706, in <module>
2017-01-10 14:24:39 INFO config-changed hooks.execute(sys.argv)
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/core/hookenv.py", line 715, in execute
2017-01-10 14:24:39 INFO config-changed self._hooks[hook_name]()
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 777, in wrapped_f
2017-01-10 14:24:39 INFO config-changed f(*args, **kwargs)
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/contrib/hardening/harden.py", line 79, in _harden_inner2
2017-01-10 14:24:39 INFO config-changed return f(*args, **kwargs)
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/config-changed", line 659, in config_changed
2017-01-10 14:24:39 INFO config-changed update_nrpe_checks()
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/config-changed", line 554, in update_nrpe_checks
2017-01-10 14:24:39 INFO config-changed rabbit.create_vhost(vhost)
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 190, in create_vhost
2017-01-10 14:24:39 INFO config-changed rabbitmqctl('add_vhost', vhost)
2017-01-10 14:24:39 INFO config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 338, in rabbitmqctl
2017-01-10 14:24:39 INFO config-changed subprocess.check_call(cmd)
2017-...

Read more...

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Reclassified to 'new' as this bug went unassigned. Need to reproduce and possibly rebase/refactor the previous bug owner's merge proposal as a gerrit review so that it can be reviewed and exercised in a current context.

Changed in rabbitmq-server (Juju Charms Collection):
status: In Progress → New
James Page (james-page)
Changed in charm-rabbitmq-server:
importance: Undecided → High
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Invalid
James Page (james-page)
Changed in charm-rabbitmq-server:
importance: High → Medium
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack rabbitmq-server charm because there has been no activity for 60 days.]

Changed in charm-rabbitmq-server:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.