rabbitmq-server startup timeouts differ between SysV and systemd

Bug #1874075 reported by Nicolas Bock
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
rabbitmq-server (Debian)
Fix Released
Unknown
rabbitmq-server (Ubuntu)
Fix Released
Low
Nicolas Bock
Xenial
Fix Released
Low
Nicolas Bock
Bionic
Fix Released
Low
Nicolas Bock
Eoan
Won't Fix
Undecided
Unassigned
Focal
Fix Released
Low
Nicolas Bock
Groovy
Fix Released
Low
Nicolas Bock

Bug Description

The startup timeouts were recently adjusted and synchronized between the SysV and systemd startup files.

https://github.com/rabbitmq/rabbitmq-server-release/pull/129

The new startup files should be included in this package.

[Impact]

After starting the RabbitMQ server process, the startup script will wait for the server to start by calling `rabbitmqctl wait` and will time out after 10 s.

The startup time of the server depends on how quickly the Mnesia database becomes available and the server will time out after `mnesia_table_loading_retry_timeout` ms times `mnesia_table_loading_retry_limit` retries. By default this wait is 30,000 ms times 10 retries, i.e. 300 s.

The mismatch between these two timeout values might lead to the startup script failing prematurely while the server is still waiting for the Mnesia tables.

This change introduces variable `RABBITMQ_STARTUP_TIMEOUT` and the `--timeout` option into the startup script. The default value for this timeout is set to 10 minutes (600 seconds).

This change also updates the systemd service file to match the timeout values between the two service management methods.

[Scope]

Upstream patch: https://github.com/rabbitmq/rabbitmq-server-release/pull/129

* Fix is not included in the Debian package
* Fix is not included in any Ubuntu series

* Groovy and Focal can apply the upstream patch as is
* Bionic and Xenial need an additional fix in the systemd service file
  to set the `RABBITMQ_STARTUP_TIMEOUT` variable for the
  `rabbitmq-server-wait` helper script.

[Test Case]

In a clustered setup with two nodes, A and B.

1. create queue on A
2. shut down B
3. shut down A
4. boot B

The broker on B will wait for A. The systemd service will wait for 10 seconds and then fail. Boot A and the rabbitmq-server process on B will complete startup.

[Regression Potential]

This change alters the behavior of the startup scripts when the Mnesia database takes long to become available. This might lead to failures further down the service dependency chain.

Changed in rabbitmq-server (Ubuntu):
assignee: nobody → Nicolas Bock (nicolasbock)
importance: Undecided → Low
Changed in rabbitmq-server (Ubuntu Eoan):
assignee: nobody → Nicolas Bock (nicolasbock)
Changed in rabbitmq-server (Ubuntu Bionic):
assignee: nobody → Nicolas Bock (nicolasbock)
Changed in rabbitmq-server (Ubuntu Xenial):
assignee: nobody → Nicolas Bock (nicolasbock)
Eric Desrochers (slashd)
description: updated
tags: added: sts
Revision history for this message
Eric Desrochers (slashd) wrote :

Thanks for your patch Nicolas.

As discussed, Focal (20.04 LTS) is transitionning from development to stable. If this fix is not release critical (which seems to be the case), we may need to wait a little bit.

We will sponsor right after the transition ends.

Meanwhile could you please:
* Add the SRU template that I cut/paste for you in the description above.
* Rework your changelog block following the feedbacks I have provided you.
* Produce debdiff(s) for all other impacted supported releases and attach them to the bug.

- Eric

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "timeout.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Revision history for this message
Nicolas Bock (nicolasbock) wrote :

I have updated the description Eric. Please have another look

description: updated
Changed in rabbitmq-server (Ubuntu Eoan):
importance: Undecided → Low
Changed in rabbitmq-server (Ubuntu Bionic):
importance: Undecided → Low
Changed in rabbitmq-server (Ubuntu Xenial):
importance: Undecided → Low
Eric Desrochers (slashd)
Changed in rabbitmq-server (Ubuntu Xenial):
status: New → In Progress
Changed in rabbitmq-server (Ubuntu Bionic):
status: New → In Progress
Changed in rabbitmq-server (Ubuntu Eoan):
status: New → In Progress
Changed in rabbitmq-server (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Eric Desrochers (slashd) wrote :

Eoan FTBFS as follows:

==> rabbitmqctl
** (Mix) You're trying to run :rabbitmqctl on Elixir v1.9.1 but it has declared in its mix.exs file it supports only Elixir >= 1.6.6 and < 1.8.0
make[4]: *** [Makefile:93: escript/rabbitmqctl] Error 1
make[4]: Leaving directory '/<<PKGBUILDDIR>>/deps/rabbitmq_cli'
make[3]: *** [erlang.mk:4322: deps] Error 2
make[3]: Leaving directory '/<<PKGBUILDDIR>>/deps/rabbit'
make[2]: *** [erlang.mk:4322: deps] Error 2
make[2]: Leaving directory '/<<PKGBUILDDIR>>'
make[1]: *** [debian/rules:18: override_dh_auto_build] Error 2
make[1]: Leaving directory '/<<PKGBUILDDIR>>'
make: *** [debian/rules:11: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2

Bug reference:
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1773324

- Eric

Changed in rabbitmq-server (Ubuntu Eoan):
status: In Progress → Won't Fix
Revision history for this message
Eric Desrochers (slashd) wrote :

The rationale behind not fixing Eoan can be found in the LP: #1773324
Note that Eoan will reach EOL in July 2020.

- Eric

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

Thanks Eric.

Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Eric Desrochers (slashd)
Changed in rabbitmq-server (Ubuntu Eoan):
assignee: Nicolas Bock (nicolasbock) → nobody
importance: Low → Undecided
Revision history for this message
Eric Desrochers (slashd) wrote :

Nicolas, can you please produce a 'groovy' 20.10 debdiff ?

Thanks

Dan Streetman (ddstreet)
tags: added: sts-sponsor-ddstreet
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
description: updated
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Changed in rabbitmq-server (Debian):
status: Unknown → New
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Brian Murray (brian-murray) wrote :

The groovy upload of this has failed to build and the SRU won't be processed until the fix also exists in groovy.

Revision history for this message
Dan Streetman (ddstreet) wrote :

@james-page, it looks like elixir-lang in groovy has exceeded the limit that rabbitmq-server wants to build with; since our rabbitmq-server is newer than Debian, I'm not sure what your merge process from upstream is. Can you do another upstream merge please so rabbitmq-server is buildable in groovy? That should also pick up the fix for this bug, as well.

Thanks!

Changed in rabbitmq-server (Ubuntu Groovy):
assignee: Nicolas Bock (nicolasbock) → James Page (james-page)
James Page (james-page)
Changed in rabbitmq-server (Ubuntu Groovy):
assignee: James Page (james-page) → nobody
Revision history for this message
Dan Streetman (ddstreet) wrote :

for groovy, the upstream merge is being worked in bug 1878049

Changed in rabbitmq-server (Ubuntu Groovy):
assignee: nobody → Nicolas Bock (nicolasbock)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rabbitmq-server - 3.8.3-0ubuntu1

---------------
rabbitmq-server (3.8.3-0ubuntu1) groovy; urgency=medium

  * New upstream release:
    - d/watch: Fix watch file to download from GitHub

 -- Nicolas Bock <email address hidden> Wed, 13 May 2020 14:25:28 +0000

Changed in rabbitmq-server (Ubuntu Groovy):
status: In Progress → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Nicolas, or anyone else affected,

Accepted rabbitmq-server into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rabbitmq-server/3.8.2-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rabbitmq-server (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Nicolas, or anyone else affected,

Accepted rabbitmq-server into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rabbitmq-server/3.6.10-1ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rabbitmq-server (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Nicolas, or anyone else affected,

Accepted rabbitmq-server into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rabbitmq-server/3.5.7-1ubuntu0.16.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rabbitmq-server (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (rabbitmq-server/3.8.2-0ubuntu1.1)

All autopkgtests for the newly accepted rabbitmq-server (3.8.2-0ubuntu1.1) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

python-pika-pool/0.1.3-4 (s390x)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#rabbitmq-server

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Is there a specific example command to run for
1. create queue on A

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

Brian, I created https://github.com/nicolasbock/rabbitmq-test with some scripts you can use to test rabbitmq. Please let me know if you have any question on how to use those.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Bryan and Nicolas,

Just a friendly reminder that this bug is pending -proposed verification (X/B/F) before it can be released to -updates.

cheers,
Mauricio

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Sorry, meant to update the bug. It appears that just this SRU change is not enough and Charm changes are needed as well.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

I guess if the SRU changes are good on their own,
they should be marked as verification done; then
the charm changes should also be looked at?

Or does the need for charm changes as well should
invalidate these SRU contents?

Thanks!

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

The current test case does not work for SRU verification. If there is a different test case that just targets the specific change, we could use that.

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

I'll try to verify the SRU. The steps are the same as in the Test Case from the description. The original package will fail fairly quickly. The SRU should fail after 10 minutes.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Bryan, thanks for clarifying!

Nicolas, hey! :) I guess since you're familiar with the patch,
there might be a simpler test-case to check just the SRUs are
working as expected? And then expand on to the charm changes
that should rely on them (IIUIC)?

Thanks!

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

I have a test env already set up Mauricio. I'll test it there.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Thanks Nicolas!

Sorry, I missed your comment that you'd verify the SRU
before posting; I hadn't refreshed the page. Oops :)

cheers!

Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :

I tested the bionic proposed and found that it was still missing a few things. I have updated and tested the package. The new debdiff is attached.

tags: added: verification-failed-bionic
removed: verification-needed-bionic
Revision history for this message
Dan Streetman (ddstreet) wrote :

Hi @nicolasbock, it looks like the latest debdiff is adding in a new config file, and that change doesn't appear to be upstream, or even in focal or groovy...can you clarify what that config file is needed and if you'll work to get it upstream before it goes into bionic?

Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :

Hi @ddstreet. By default rabbitmq-server will wait for 300 seconds before it gives up. Raising the timeout in the systemd service file is not sufficient to guarantee that rabbitmq-server will wait for 10 minutes. We can either add the new configuration file or add a retry to the systemd service file. However, the retry will only affect systemd and not sysv, the new configuration file will affect both.

I have added another debdiff (based on proposed) that does both.

Revision history for this message
Dan Streetman (ddstreet) wrote :

@nicolasbock as we talked about, you can't just throw new changes into Bionic without working on Focal/Groovy (and upstream) first.

Additionally, to address the actual change, I'm concerned you are adding a config file to get the mnesia and rabbitmq timeouts to match - that doesn't help anyone upstream, or using any other distro. The "correct" change would be to adjust the actual default instead of Ubuntu carrying a config file, right? Please have a talk with upstream to get the config defaults correctly matching, and then the additional fix should go to F/G before updating Bionic.

Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :
Revision history for this message
Nicolas Bock (nicolasbock) wrote :

Verified focal and groovy

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Ok, I am slightly confused here right now. From the comments I see that the current SRUs are not enough for the bug to be fixed, is that correct? Especially that I see debdiffs for bionic *and* focal attached.

Also, with so many different comments and issues regarding verification, I think I would need a bit more regarding what testing has been performed and on which packages (with logs perhaps). Just a "verified focal and groovy" is too ambiguous. For now I'm switching it back to 'verification-needed'.

As for all the additional debdiffs - I assume those also need to be included for the fix to be complete, yes? Are those changes already present in groovy?

tags: added: verification-needed-focal
removed: verification-done-focal
Revision history for this message
Dan Streetman (ddstreet) wrote :

@sil2100, last time I talked to @nicolasbock he was unclear on why it was failing in bionic and passing in focal. I think probably the best thing to do at this point is reject the versions in -proposed for x/b/f, or just leave them in -proposed indefinitely, and he can take more time to come up with the proper patches and then re-request sponsorship. Sorry for the confusion here.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

@ddstreet asked me to take a look.

Step-by-step reproducer:

Set up a 2 node rabbitmq cluster using virtual machines.
Make the hostnames rabbitmq1 and rabbitmq2.
Add each host to /etc/hosts in each vm.

Create the cluster:

1) On both hosts: sudo apt install rabbitmq-server
2) On host 1, copy the string in /var/lib/rabbitmq/.erlang.cookie and place it in /var/lib/rabbitmq/.erlang.cookie on host 2.
3) On host 2, restart the rabbitmq service: sudo systemctl restart rabbitmq-server
4) On host 2, stop the server: sudo rabbitmqctl stop_app
5) On host 2: sudo rabbitmqctl reset
6) On host 2: sudo rabbitmqctl join_cluster rabbit@rabbitmq1
7) On host 2: sudo rabbitmqctl start_app
8) On host 1: sudo rabbitmqctl cluster_status

You should see both rabbit hosts in the cluster.

Set up the queues:

On host 1:

1) sudo rabbitmqctl add_user tester linux
2) sudo rabbitmqctl add_vhost tester
3) sudo rabbitmqctl set_permissions -p tester tester ".*" ".*" ".*"
4) sudo rabbitmqctl set_policy -p tester HA ".*" '{"ha-mode": "all"}'
5) sudo rabbitmqctl list_permissions -p tester
6) sudo rabbitmqctl list_policies -p tester

On VM host:

1) git clone https://github.com/nicolasbock/rabbitmq-test.git
2) virtualenv venv
3) . venv/bin/activate
4) pip install -r requirements.txt
5) python setup.py install
6) ./test-rabbit.py <host 1 IP addr> --send 'message 1'
7) ./test-rabbit.py <host 1 IP addr> --list

On host 1:

1) sudo rabbitmqctl list_queues -p tester name pid slave_pids
Listing queues
test_queue <email address hidden> [<email address hidden>]

In my case, rabbitmq1 is the primary owner of the queue denoted with <>, with rabbitmq2 being a slave, denoted with [].

We want to shut down the primary host, so the slave gets promoted to being primary.

Shut down rabbitmq1.

On host 2, confirm it has become primary with:

1) sudo rabbitmqctl list_queues -p tester name pid slave_pids
Listing queues
test_queue <email address hidden> []

Send a new message, to push the queue ahead of what rabbitmq1 currently knows about.

On VM host:

1) ./test-rabbit.py <host 2 IP addr> --send 'message 2'
2) ./test-rabbit.py <host2 IP addr> --list

Shut rabbitmq2 down, so all VMs are off.

Attempt to boot rabbitmq1 now. rabbitmq1 will assume it is behind, and needs to wait for rabbitmq2 to come online before we continue. This is where the issue occurs. Check the status of rabbitmq-server.service on rabbitmq1.

1) sudo systemctl status rabbitmq-server.service

Revision history for this message
Matthew Ruffell (mruffell) wrote :

What happens (packages from -updates):

On Groovy: Assuming same behaviour as focal due to systemd service file. Untested.

On Focal: The rabbitmq service will start, and stay in 'activating' mode until the daemon notifies systemd that it has started up (type=notify). Every 300 seconds / 5 minutes rabbitmq will log failure to synchronise the message queue until rabbitmq2 returns, but the daemon never dies. TimeoutStartSec=3600 or one hour, so daemon stays waiting for 1 hour, with it soft resetting every 5 minutes as queue synchronisation timeouts occur. Service will only change to 'active' when rabbitmq2 starts and the message queue is synced.

From what I understand, I don't think there is any problems on focal or groovy. As long as rabbitmq2 comes up within an hour, things work. Note because of this bug, groovy and upstream has now been changed to 10 min timeout, down from 1hr.

On Eoan: Assuming same behaviour as focal due to systemd service file. Untested.

On Bionic: The rabbitmq service will start, and runs a ExecStartPost script that waits on the rabbitmq daemon. If this ExecStartPost script times out (which it does after 90 seconds it seems, even though documentation suggests infinite timeout), it terminates with a error exit code, and since the Unit type=simple, systemd marks the service as failed. There is no Restart=on-failure on Bionic's systemd unit, and rabbitmq stays dead. Rabbitmq dies 90 seconds after boot, and will never rejoin the cluster by itself. The machine needs to be power cycled, or manual ssh in and restart rabbitmq services.

On Xenial: Assuming same behaviour as Bionic due to systemd service file. Untested.

Suggested actions:
For Bionic: From my understanding of the problem and my testing, I found that replacing the systemd service file with the one from focal, which changes type=simple to type=notify, with a 1hr timeout, and restart=on-failure solves the problem. Notes: I checked the source code, and rabbitmq in bionic does indeed support type=notify, although, we need to add a dependency to the package, socat. See below commit for details:

commit: 2d6383bade61fea0b8652b72d25bb1a9f0d6133f
From: Alexey Lebedeff <email address hidden>
Date: Fri, 11 Mar 2016 17:42:15 +0300
Subject: Improve systemd integration
Link: https://github.com/rabbitmq/rabbitmq-server/commit/2d6383bade61fea0b8652b72d25bb1a9f0d6133f

Github Issue for above commit: https://github.com/rabbitmq/rabbitmq-server/issues/664

Xenial: I need to dig into this. We will likely follow the same path as bionic, but we need to be careful to ensure service type=notify is sufficiently supported in rabbitmq 3.5.7 before we SRU the change. Will also likely need socat as a dependency and maybe a backport of the above commit.

Revision history for this message
Dan Streetman (ddstreet) wrote :

@mruffell thanks! Only a few comments below:

> Note because of this bug, groovy and upstream has now been changed to 10 min timeout, down from 1hr.

We should decide if this is really what we want to do. And if it should revert to the longer 1hr timeout, propose that upstream.

I don't really know, is either default timeout better, 10 minutes or 1 hour? @nicolasbock did you have specific reasoning for the upstream reduction to 10 minutes?

If 10 minutes is what we want, then we should be ok upstream and in Groovy, and in Focal with the current code in -proposed.

> On Eoan: Assuming same behaviour as focal due to systemd service file. Untested.

yeah, it FTBFS in Eoan unfortunately; there is bug 1843761, and also I detailed why it fails in the description for bug 1773324. As Eoan is almost EOL, my opinion is it's safer to simply leave it untouched there.

> If this ExecStartPost script times out (which it does after 90 seconds it seems, even though documentation suggests infinite timeout)

yep, systemd has DefaultTimeoutStartSec set to 90s (man systemd-system.conf for more details), so if TimeoutStartSec isn't specified for a service unit, it will default to 90 seconds (and I believe the timeout period includes the ExecStartPre, ExecStart, and ExecStartPost actions, but I'd have to specifically check the code to verify that).

> we need to add a dependency to the package, socat

well, this is usually a problem for SRU releases. Unfortunately, adding new deps for SRU releases causes 'sudo apt-get upgrade' to *not* upgrade any package that pulls in new (not currently installed) deps. While 'sudo apt upgrade' *does* pull in new deps, the ~ubuntu-sru team typically rejects adding new runtime deps to any SRU, without a very strong reason.

Instead of pulling the entire service file back into Bionic, I think it might be enough to only add 'TimeoutStartSec=600', which should cover the timeout for the ExecStart= and ExecStartPost= actions. It may be also worth adding the Restart=on-failure and RestartSec=10 params. Could you test with the TimeoutStartSec param in bionic to see if that's enough to SRU?

If pulling back only the TimeoutStartSec=600 param to Bionic works, that will hopefully be enough for Xenial, too.

Thanks!

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

@mruffell thanks! Only a few comments below:

> > Note because of this bug, groovy and upstream has now been changed
> > to 10 min timeout, down from 1hr.
>
> We should decide if this is really what we want to do. And if it
> should revert to the longer 1hr timeout, propose that upstream.
>
> I don't really know, is either default timeout better, 10 minutes or
> 1 hour? @nicolasbock did you have specific reasoning for the
> upstream reduction to 10 minutes?

The 10 minutes were a compromise between the default of 5 minutes (for
rabbitmq-server) and what upstream thought was a very long wait of 1
hour.

> If 10 minutes is what we want, then we should be ok upstream and in
> Groovy, and in Focal with the current code in -proposed.

In the context of the issues we are seeing with charms the 10 minute
timeout should be sufficient.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> In the context of the issues we are seeing with charms the 10 minute
> timeout should be sufficient.

right, but this isn't just changing rabbitmq-server used by charms, this is changing the behavior for *all* Ubuntu users of rabbitmq-server, as well as upstream. Since upstream did accept it, my *assumption* is yes, 10 minutes is a good default, but since mismatched timeouts is essentially the cause of this entire problem, I thought it was worth just re-checking again, to make sure we all thought about it carefully with *all users* in mind, before leaving it at that.

To poke the thought button further, note that since the upstream (and f/g) service files also have 'Restart=on-failure' set, and will go 10 minutes (as configured with the TimeoutStartSec=600 param), the service is *effectively* set to never, ever timeout, since it will just restart itself each time it times out; as the StartLimitIntervalSec= and StartLimitBurst= will never be exceeded (since they default to 10s and 5, respectively).

So, I suppose since the effective result is that in F and later (including upstream), the service will wait forever, with restart-on-failure happening every 10 minutes, until it successfully is able to start. With that in mind, I don't think the actual TimeoutStartSec= setting makes any difference at all (as long as it's long enough to avoid reaching the restart StartLimit settings), besides controlling how often the service logs a failure and then restart.

I guess this all means that 1) the version in focal-proposed is correct, and 2) the xenial and bionic versions need the addition of TimeoutStartSec=600 and Restart=on-failure to their service file, right? Is that all that's needed?

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

> I guess this all means that 1) the version in focal-proposed is
> correct, and 2) the xenial and bionic versions need the addition of
> TimeoutStartSec=600 and Restart=on-failure to their service file,
> right? Is that all that's needed?

I agree with your assessment regarding Focal and Groovy. However, I
would like to verify this thesis for Bionic and Xenial to make sure
that rabbitmq-server and systemd behave that way there as well. Note
that the systemd service file on Bionic has an ExecStartPost calling a
wrapper script in addtion to the ExecStart which wouldn't be necessary
if the rabbitmq-server behaved the same way it does in Focal and
Groovy. In addition I am uncertain why the service doesn't use
`Type=notify` as in the later versions. Rabbitmq-server-3.6.10
understands what `sd_notify` is and I would have thought that this
implies that we should be able to use `Type=notify` on Bionic.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> In addition I am uncertain why the service doesn't use
> `Type=notify` as in the later versions. Rabbitmq-server-3.6.10
> understands what `sd_notify` is and I would have thought that this
> implies that we should be able to use `Type=notify` on Bionic

our goal is to make the version in Bionic work; not to make the version in Bionic identical to later versions. Unless there is a specific *need* to change the service type to notify, we don't *need* to do that.

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

On bionic the rabbitmq-server process returns with exit code 0
regardless of whether it managed to start the server or not.
Presumably this behavior is the reason for the ExecStartPost. If I
change the service to type notify then systemd notices that the
ExecStart command fails to start and the ExecStartPost is not
necessary (it is never executed then). However the end result seems to
be the same --> Let's leave the type as simple.

When I add Restart=on-failure the call to `systemctl start
rabbitmq-server` returns with an error message, but the
rabbitmq-server service is restarted every 10 seconds anyway.

I don't know whether the non-blocking of `systemctl start` will cause
issues, but the cluster now recovers as soon as the other broker is
started.

The relevant changes to the service file are:

[Service]
TimeoutStartSec=600
Restart=on-failure
RestartSec=10

Revision history for this message
Matthew Ruffell (mruffell) wrote :

I agree that the fixes in -proposed for Focal and Groovy are good as is, since it works nicely with type=notify on those systems.

I spent some time testing on Bionic. I can confirm that the existence of ExecStartPost is for the very reason that Nicolas describes, that RabbitMQ will return 0 regardless if the server started correctly or not.

When I removed ExecStartPost, and set

TimeoutStartSec=600
Restart=on-failure
RestartSec=10

and reproduced, once RabbitMQ did 10x 30000ms timeouts, it exited 0, and systemd assumed it was a clean and expected stop, and the service was stopped with ExitSuccess. The result being the service dies after 5 minutes. Its an improvement over 90 seconds, but not quite 10 min gold standard.

I then added ExecStartPost back in, and added the modification to the wrapper script that Nicolas put forward, aka /usr/lib/rabbitmq/bin/rabbitmq-server-wait:

/usr/lib/rabbitmq/bin/rabbitmqctl wait -t 600 $RABBITMQ_PID_FILE

I then went and reproduced. systemd now doesn't treat the service as started until it actually joins the cluster, instead it is in the activating state while it is waiting for the cluster leader to turn up.

I left the VM in this activating state for quite some time. Each time the 10x 3000ms timeouts finish, the wrapper script exits with failure, and systemd restarts the service, and we go back to another 10x 3000ms cycle. From my testing it never stops after 2 rounds, instead, it goes forever, likely due to Restart=on-failure and RestartSec=10 being set.

This works wonderfully, and fixes the problem. I'm sorry I ever doubted your solution Nicolas.

While I do think the better solution is to set type=notify, on Bionic it would require the socat dependency to actually send the notification to systemd, and as Dan mentioned before, that is unacceptable for a SRU, and the apt upgrade reasoning makes sense. We can enjoy perfect systemd controlled resilience on focal onwards.

On Bionic, I think staying with type=simple is fine, and we just need to set

TimeoutStartSec=600
Restart=on-failure
RestartSec=10

and make the below change to /usr/lib/rabbitmq/bin/rabbitmq-server-wait

/usr/lib/rabbitmq/bin/rabbitmqctl wait -t 600 $RABBITMQ_PID_FILE

I don't think we need to change the default mnesia_table_loading_retry_limit or mnesia_table_loading_retry_timeout values, since if we set the systemd Restart=on-failure setting, once the wrapper script dies at the 5 minute timeout, the service will just be restarted and we begin anew.

I'll make a test package now, and see how it goes.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Okay, I made test packages for Bionic and Xenial based on the above:

The ppa is available here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp1874075-test

It contains (based off of -updates):
Xenial:
rabbitmq-server 3.5.7-1ubuntu0.16.04.2+lp1874075v20200629b1
Bionic:
rabbitmq-server 3.6.10-1ubuntu0.1+lp1874075v20200629b1

Debdiffs for the above builds are:
Xenial: https://paste.ubuntu.com/p/Jm8ZctJzny/
Bionic: https://paste.ubuntu.com/p/j6cBPzgWMD/

On Bionic:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 5 minutes or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.

On Xenial:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 60 seconds. This is much shorter than Bionic. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.

It seems the timeouts happen at the mercy of mnesia_table_loading_retry_limit and mnesia_table_loading_retry_timeout values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'. Nicolas, it seems you are right, and that if we didn't want our services to restart every 60 (xenial) or 300 (bionic) seconds, we would need to adjust these timeouts. The problem is, we would have to introduce new configuration files to do this, which is normally frowned on when doing a SRU.

Now that we have Restart=on-failure and RestartSec=10 would I add config to change mnesia_table_loading_retry_timeout? To be honest I am happy with leaving them as is, and just relying on Restart=on-failure to do its job. @ddstreet do you have any strong opinions? Is a service restarting every 60 seconds unacceptable until the node can rejoin the cluster?

Nicolas, can you install and test these packages and double check that you also see what I see. If everything is good, you can submit new debdiffs for Xenial and Bionic based on my ones, and we can get some new builds into -proposed.

Nicolas, I think you are more or less right all along, and all you were missing is Restart=on-failure and RestartSec=10 in the service file.

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

@mruffel, I can confirm that the bionic package from your PPA is working.

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

@mruffel, I can also confirm that the xenial package is working.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Thanks @nicolasbock @mruffell! I think the latest changes make sense and I think the shorter wait time in Xenial is ok, as systemd should continue to restart it until it's successful.

I rebased @mruffell's debdiffs on the -proposed versions, and added a short changelog entry, and uploaded to x/b.

As the version in focal-proposed has been tested as working correctly, and we now understand what was missing from x/b, I'm going to re-mark this as verification-done-focal.

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Nicolas, or anyone else affected,

Accepted rabbitmq-server into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rabbitmq-server/3.6.10-1ubuntu0.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags: added: verification-needed-bionic
removed: verification-failed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Nicolas, or anyone else affected,

Accepted rabbitmq-server into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rabbitmq-server/3.5.7-1ubuntu0.16.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Dan Streetman (ddstreet) wrote :

marking verification-done-focal as i mentioned in comment 71

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rabbitmq-server - 3.8.2-0ubuntu1.1

---------------
rabbitmq-server (3.8.2-0ubuntu1.1) focal; urgency=medium

  * Fix timeout discrepancy between SysV and systemd (LP: #1874075)
    upstream, rabbitmq-server-release - 694540270c8
  * d/rabbitmq-server.init
    - Add RABBITMQ_STARTUP_TIMEOUT and default to 600
  * d/rabbitmq-server.service
    - Default TimeoutStartSec to 600

 -- Nicolas Bock <email address hidden> Tue, 21 Apr 2020 06:37:55 -0600

Changed in rabbitmq-server (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for rabbitmq-server has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for rabbitmq-server in -proposed for focal.

I know the focal package is released to -updates, but I will verify again regardless.

I installed 3.8.2-0ubuntu1.1 from -proposed, and followed the instructions I wrote in comment #59.

I sent a message to host 1, checked that host 1 was master, and host 2 was the slave. I shut host 1 down, and then sent a message to host 2. Host 2 became the new master. I shut host 2 down. I then started host 1 up again.

When I ran "sudo systemctl status rabbitmq-server.service", rabbit was in the "activating" state, due to having its systemd service as type=notify. I let it sit on "activating" for an entire 10 minutes, at which point it timed out as planned, and the service restarted as expected after 10 seconds. The service came up again in the "activating" state as expected. I then started host 2 up again, and after it had come up, both rabbitmq-server services were in the active state, and servicing requests.

The package is behaving as expected on focal, and I am happy to mark this as verified.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for rabbitmq-server in -proposed for bionic.

I installed 3.6.10-1ubuntu0.3 from -proposed, and followed the instructions I wrote in comment #59.

I sent a message to host 1, checked that host 1 was master, and host 2 was the slave. I shut host 1 down, and then sent a message to host 2. Host 2 became the new master. I shut host 2 down. I then started host 1 up again.

The rabbit service came up in a "activating" state, due to ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait kicking in. I looked at /var/log/rabbitmq/host1.log, and rabbit logged its attempts to sync with host 2 every 30 seconds, for 10 tries. At this point, /usr/lib/rabbitmq/bin/rabbitmq-server-wait exits with an error code since it has timed out, and systemd then restarts the service after a 10 second delay. This is all as expected. The service comes back up in the activating state and continues to wait for host 2. I left it like this for 15 minutes, and the rabbitmq-server-wait timeout, restart and waiting all works correctly.

I then started host 2 up, and host 1 rejoined the cluster. Both hosts have the rabbit service in the active state.

The package is behaving as expected on bionic, and I am happy to mark this as verified.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for rabbitmq-server in -proposed for xenial.

I installed 3.5.7-1ubuntu0.16.04.4 from -proposed, and followed the instructions I wrote in comment #59.

I sent a message to host 1, checked that host 1 was master, and host 2 was the slave. I shut host 1 down, and then sent a message to host 2. Host 2 became the new master. I shut host 2 down. I then started host 1 up again.

The rabbit service came up in a "activating" state, due to ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait kicking in. The rabbitmq in xenial does not offer verbose logging in its /var/log/rabbitmq/rabbit.log file about waiting for host 2, so instead I focused on the systemd service. The service will time out after 60 seconds, something I noted and explained previously, which myself and @ddstreet find to be an acceptable timeout. When the timeout occurs, /usr/lib/rabbitmq/bin/rabbitmq-server-wait exits with an error code, which causes systemd to restart the service, as expected. After a 10 second wait, the service starts in the activating state, waiting for host 2 to come up. I left the service for 15 minutes, and it timed out and restarted successfully 15 times.

I then started host 2 up, and host 1 rejoined the cluster. Both hosts have the rabbit service in the active state.

The package is behaving as expected on xenial, and I am happy to mark this as verified.

tags: added: verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rabbitmq-server - 3.6.10-1ubuntu0.3

---------------
rabbitmq-server (3.6.10-1ubuntu0.3) bionic; urgency=medium

  * d/rabbitmq-server.service:
    - add TimeoutStartSec, Restart, RestartSec parameters (LP: #1874075)

rabbitmq-server (3.6.10-1ubuntu0.2) bionic; urgency=medium

  * Fix timeout discrepancy between SysV and systemd (LP: #1874075)
    upstream, rabbitmq-server-release - 694540270c8
  * d/rabbitmq-server.init
    - Add RABBITMQ_STARTUP_TIMEOUT and default to 600
  * d/rabbitmq-env.conf
    - Default TimeoutStartSec to 600
  * d/rabbitmq-server-wait
    - Use value of RABBITMQ_STARTUP_TIMEOUT in wait

 -- Matthew Ruffell <email address hidden> Mon, 29 Jun 2020 15:07:53 +1200

Changed in rabbitmq-server (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rabbitmq-server - 3.5.7-1ubuntu0.16.04.4

---------------
rabbitmq-server (3.5.7-1ubuntu0.16.04.4) xenial; urgency=medium

  * d/rabbitmq-server.service:
    - add TimeoutStartSec, Restart, RestartSec parameters (LP: #1874075)

rabbitmq-server (3.5.7-1ubuntu0.16.04.3) xenial; urgency=medium

  * Fix timeout discrepancy between SysV and systemd (LP: #1874075)
    upstream, rabbitmq-server-release - 694540270c8
  * d/rabbitmq-server.init
    - Add RABBITMQ_STARTUP_TIMEOUT and default to 600
  * d/rabbitmq-env.conf
    - Default TimeoutStartSec to 600
  * d/rabbitmq-server-wait
    - Use value of RABBITMQ_STARTUP_TIMEOUT in wait

 -- Matthew Ruffell <email address hidden> Mon, 29 Jun 2020 15:43:21 +1200

Changed in rabbitmq-server (Ubuntu Xenial):
status: Fix Committed → Fix Released
Dan Streetman (ddstreet)
tags: added: sts-sponsor
removed: sts-sponsor-ddstreet
Dan Streetman (ddstreet)
tags: removed: sts-sponsor
Changed in rabbitmq-server (Debian):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.